summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2013-04-29Merge tag 'char-misc-3.10-rc1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver update from Greg Kroah-Hartman: "Here's the big char / misc driver update for 3.10-rc1 A number of various driver updates, the majority being new functionality in the MEI driver subsystem (it's now a subsystem, it started out just a single driver), extcon updates, memory updates, hyper-v updates, and a bunch of other small stuff that doesn't fit in any other tree. All of these have been in linux-next for a while" * tag 'char-misc-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (148 commits) Tools: hv: Fix a checkpatch warning tools: hv: skip iso9660 mounts in hv_vss_daemon tools: hv: use FIFREEZE/FITHAW in hv_vss_daemon tools: hv: use getmntent in hv_vss_daemon Tools: hv: Fix a checkpatch warning tools: hv: fix checks for origin of netlink message in hv_vss_daemon Tools: hv: fix warnings in hv_vss_daemon misc: mark spear13xx-pcie-gadget as broken mei: fix krealloc() misuse in in mei_cl_irq_read_msg() mei: reduce flow control only for completed messages mei: reseting -> resetting mei: fix reading large reposnes mei: revamp mei_irq_read_client_message function mei: revamp mei_amthif_irq_read_message mei: revamp hbm state machine Revert "drivers/scsi: use module_pcmcia_driver() in pcmcia drivers" Revert "scsi: pcmcia: nsp_cs: remove module init/exit function prototypes" scsi: pcmcia: nsp_cs: remove module init/exit function prototypes mei: wd: fix line over 80 characters misc: tsl2550: Use dev_pm_ops ...
2013-04-29mm: Convert print_symbol to %pSRJoe Perches2-9/+7
Use the new vsprintf extension to avoid any possible message interleaving. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-04-27vm: add no-mmu vm_iomap_memory() stubLinus Torvalds1-0/+10
I think we could just move the full vm_iomap_memory() function into util.h or similar, but I didn't get any reply from anybody actually using nommu even to this trivial patch, so I'm not going to touch it any more than required. Here's the fairly minimal stub to make the nommu case at least potentially work. It doesn't seem like anybody cares, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-17mm/vmscan: fix error return in kswapd_run()Xishi Qiu1-1/+1
Fix the error return value in kswapd_run(). The bug was introduced by commit d5dc0ad928fb ("mm/vmscan: fix error number for failed kthread"). Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-17hugetlbfs: add swap entry check in follow_hugetlb_page()Naoya Horiguchi1-1/+11
With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory error happens on a hugepage and the affected processes try to access the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0) in get_page(). The reason for this bug is that coredump-related code doesn't recognise "hugepage hwpoison entry" with which a pmd entry is replaced when a memory error occurs on a hugepage. In other words, physical address information is stored in different bit layout between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() which is called in get_dump_page() returns a wrong page from a given address. The expected behavior is like this: absent is_swap_pte FOLL_DUMP Expected behavior ------------------------------------------------------------------- true false false hugetlb_fault false true false hugetlb_fault false false false return page true false true skip page (to avoid allocation) false true true hugetlb_fault false false true return page With this patch, we can call hugetlb_fault() and take proper actions (we wait for migration entries, fail with VM_FAULT_HWPOISON_LARGE for hwpoisoned entries,) and as the result we can dump all hugepages except for hwpoisoned ones. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [2.6.34+?] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-16vm: add vm_iomap_memory() helper functionLinus Torvalds1-0/+47
Various drivers end up replicating the code to mmap() their memory buffers into user space, and our core memory remapping function may be very flexible but it is unnecessarily complicated for the common cases to use. Our internal VM uses pfn's ("page frame numbers") which simplifies things for the VM, and allows us to pass physical addresses around in a denser and more efficient format than passing a "phys_addr_t" around, and having to shift it up and down by the page size. But it just means that drivers end up doing that shifting instead at the interface level. It also means that drivers end up mucking around with internal VM things like the vma details (vm_pgoff, vm_start/end) way more than they really need to. So this just exports a function to map a certain physical memory range into user space (using a phys_addr_t based interface that is much more natural for a driver) and hides all the complexity from the driver. Some drivers will still end up tweaking the vm_page_prot details for things like prefetching or cacheability etc, but that's actually relevant to the driver, rather than caring about what the page offset of the mapping is into the particular IO memory region. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-15memcg: force use_hierarchy if sane_behaviorTejun Heo1-0/+17
Turn on use_hierarchy by default if sane_behavior is specified and don't create .use_hierarchy file. It is debatable whether to remove .use_hierarchy file or make it ro as the former could make transition easier in certain cases; however, the behavior changes which will be gated by sane_behavior are intensive including changing basic meaning of certain control knobs in a few controllers and I don't really think keeping this piece would make things easier in any noticeable way, so let's remove it. v2: Explain that mem_cgroup_bind() doesn't have to worry about children as suggested by Michal Hocko. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2013-04-14Merge 3.9-rc7 into char-misc-nextGreg Kroah-Hartman3-2/+3
We want the fixes in there. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-12x86-32: Fix possible incomplete TLB invalidate with PAE pagetablesDave Hansen1-0/+1
This patch attempts to fix: https://bugzilla.kernel.org/show_bug.cgi?id=56461 The symptom is a crash and messages like this: chrome: Corrupted page table at address 34a03000 *pdpt = 0000000000000000 *pde = 0000000000000000 Bad pagetable: 000f [#1] PREEMPT SMP Ingo guesses this got introduced by commit 611ae8e3f520 ("x86/tlb: enable tlb flush range support for x86") since that code started to free unused pagetables. On x86-32 PAE kernels, that new code has the potential to free an entire PMD page and will clear one of the four page-directory-pointer-table (aka pgd_t entries). The hardware aggressively "caches" these top-level entries and invlpg does not actually affect the CPU's copy. If we clear one we *HAVE* to do a full TLB flush, otherwise we might continue using a freed pmd page. (note, we do this properly on the population side in pud_populate()). This patch tracks whenever we clear one of these entries in the 'struct mmu_gather', and ensures that we follow up with a full tlb flush. BTW, I disassembled and checked that: if (tlb->fullmm == 0) and if (!tlb->fullmm && !tlb->need_flush_all) generate essentially the same code, so there should be zero impact there to the !PAE case. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Artem S Tashkinov <t.artem@mailcity.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-09lift sb_start_write() out of ->write()Al Viro1-3/+0
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09lift sb_start_write/sb_end_write out of ->aio_write()Al Viro1-2/+0
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-07memcg: fix memcg_cache_name() to use cgroup_name()Michal Hocko1-31/+32
As cgroup supports rename, it's unsafe to dereference dentry->d_name without proper vfs locks. Fix this by using cgroup_name() rather than dentry directly. Also open code memcg_cache_name because it is called only from kmem_cache_dup which frees the returned name right after kmem_cache_create_memcg makes a copy of it. Such a short-lived allocation doesn't make too much sense. So replace it by a static buffer as kmem_cache_dup is called with memcg_cache_mutex. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-05slub: tid must be retrieved from the percpu area of the current processorChristoph Lameter1-3/+9
As Steven Rostedt has pointer out: rescheduling could occur on a different processor after the determination of the per cpu pointer and before the tid is retrieved. This could result in allocation from the wrong node in slab_alloc(). The effect is much more severe in slab_free() where we could free to the freelist of the wrong page. The window for something like that occurring is pretty small but it is possible. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-04-05slub: Do not dereference NULL pointer in node_matchChristoph Lameter1-1/+1
The variables accessed in slab_alloc are volatile and therefore the page pointer passed to node_match can be NULL. The processing of data in slab_alloc is tentative until either the cmpxhchg succeeds or the __slab_alloc slowpath is invoked. Both are able to perform the same allocation from the freelist. Check for the NULL pointer in node_match. A false positive will lead to a retry of the loop in __slab_alloc. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-04-04mm: prevent mmap_cache race in find_vma()Jan Stancek2-2/+2
find_vma() can be called by multiple threads with read lock held on mm->mmap_sem and any of them can update mm->mmap_cache. Prevent compiler from re-fetching mm->mmap_cache, because other readers could update it in the meantime: thread 1 thread 2 | find_vma() | find_vma() struct vm_area_struct *vma = NULL; | vma = mm->mmap_cache; | if (!(vma && vma->vm_end > addr | && vma->vm_start <= addr)) { | | mm->mmap_cache = vma; return vma; | ^^ compiler may optimize this | local variable out and re-read | mm->mmap_cache | This issue can be reproduced with gcc-4.8.0-1 on s390x by running mallocstress testcase from LTP, which triggers: kernel BUG at mm/rmap.c:1088! Call Trace: ([<000003d100c57000>] 0x3d100c57000) [<000000000023a1c0>] do_wp_page+0x2fc/0xa88 [<000000000023baae>] handle_pte_fault+0x41a/0xac8 [<000000000023d832>] handle_mm_fault+0x17a/0x268 [<000000000060507a>] do_protection_exception+0x1e2/0x394 [<0000000000603a04>] pgm_check_handler+0x138/0x13c [<000003fffcf1f07a>] 0x3fffcf1f07a Last Breaking-Event-Address: [<000000000024755e>] page_add_new_anon_rmap+0xc2/0x168 Thanks to Jakub Jelinek for his insight on gcc and helping to track this down. Signed-off-by: Jan Stancek <jstancek@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-02Merge branch 'writeback-workqueue' of ↵Jens Axboe6-250/+50
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core Tejun writes: ----- This is the pull request for the earlier patchset[1] with the same name. It's only three patches (the first one was committed to workqueue tree) but the merge strategy is a bit involved due to the dependencies. * Because the conversion needs features from wq/for-3.10, block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those workqueue conflicts from flaring up in block tree. * Resolving the issue that Jan and Dave raised about debugging requires arch-wide changes. The patchset is being worked on[2] but it'll have to go through -mm after these changes show up in -next, and not included in this pull request. The three commits are located in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue Pulling it into block/for-3.10/core produces a conflict in drivers/md/raid5.c between the following two commits. e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available") 2f6db2a707 ("raid5: use bio_reset()") The conflict is trivial - one removes an "if ()" conditional while the other removes "rbi->bi_next = NULL" right above it. We just need to remove both. The merged branch is available at git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge so that you can use it for verification. The test merge commit has proper merge description. While these changes are a bit of pain to route, they make code simpler and even have, while minute, measureable performance gain[3] even on a workload which isn't particularly favorable to showing the benefits of this conversion. ---- Fixed up the conflict. Conflicts: drivers/md/raid5.c Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-02slub: add 'likely' macro to inc_slabs_node()Joonsoo Kim1-1/+1
After boot phase, 'n' always exist. So add 'likely' macro for helping compiler. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-04-02slub: correct to calculate num of acquired objects in get_partial_node()Joonsoo Kim1-8/+9
There is a subtle bug when calculating a number of acquired objects. Currently, we calculate "available = page->objects - page->inuse", after acquire_slab() is called in get_partial_node(). In acquire_slab() with mode = 1, we always set new.inuse = page->objects. So, acquire_slab(s, n, page, object == NULL); if (!object) { c->page = page; stat(s, ALLOC_FROM_PARTIAL); object = t; available = page->objects - page->inuse; !!! availabe is always 0 !!! ... Therfore, "available > s->cpu_partial / 2" is always false and we always go to second iteration. This patch correct this problem. After that, we don't need return value of put_cpu_partial(). So remove it. Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-04-01writeback: expose the bdi_wq workqueueTejun Heo1-1/+1
There are cases where userland wants to tweak the priority and affinity of writeback flushers. Expose bdi_wq to userland by setting WQ_SYSFS. It appears under /sys/bus/workqueue/devices/writeback/ and allows adjusting maximum concurrency level, cpumask and nice level. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-01writeback: replace custom worker pool implementation with unbound workqueueTejun Heo1-227/+28
Writeback implements its own worker pool - each bdi can be associated with a worker thread which is created and destroyed dynamically. The worker thread for the default bdi is always present and serves as the "forker" thread which forks off worker threads for other bdis. there's no reason for writeback to implement its own worker pool when using unbound workqueue instead is much simpler and more efficient. This patch replaces custom worker pool implementation in writeback with an unbound workqueue. The conversion isn't too complicated but the followings are worth mentioning. * bdi_writeback->last_active, task and wakeup_timer are removed. delayed_work ->dwork is added instead. Explicit timer handling is no longer necessary. Everything works by either queueing / modding / flushing / canceling the delayed_work item. * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off bdi_writeback->dwork. On each execution, it processes bdi->work_list and reschedules itself if there are more things to do. The function also handles low-mem condition, which used to be handled by the forker thread. If the function is running off a rescuer thread, it only writes out limited number of pages so that the rescuer can serve other bdis too. This preserves the flusher creation failure behavior of the forker thread. * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell bdi_writeback_workfn() about on-going bdi unregistration so that it always drains work_list even if it's running off the rescuer. Note that the original code was broken in this regard. Under memory pressure, a bdi could finish unregistration with non-empty work_list. * The default bdi is no longer special. It now is treated the same as any other bdi and bdi_cap_flush_forker() is removed. * BDI_pending is no longer used. Removed. * Some tracepoints become non-applicable. The following TPs are removed - writeback_nothread, writeback_wake_thread, writeback_wake_forker_thread, writeback_thread_start, writeback_thread_stop. Everything, including devices coming and going away and rescuer operation under simulated memory pressure, seems to work fine in my test setup. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com>
2013-04-01writeback: remove unused bdi_pending_listTejun Heo1-3/+1
There's no user left. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fengguang Wu <fengguang.wu@intel.com>
2013-04-01Merge v3.9-rc5 into char-misc-nextGreg Kroah-Hartman3-17/+10
This picks up the fixes in 3.9-rc5 that we need here. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-29mm: export split_page()K. Y. Srinivasan1-0/+1
This symbol will be used in the Hyper-V balloon driver to support 2M allocations. Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-28Revert "mm: introduce VM_POPULATE flag to better deal with racy userspace ↵Michel Lespinasse3-17/+10
programs" This reverts commit 186930500985 ("mm: introduce VM_POPULATE flag to better deal with racy userspace programs"). VM_POPULATE only has any effect when userspace plays racy games with vmas by trying to unmap and remap memory regions that mmap or mlock are operating on. Also, the only effect of VM_POPULATE when userspace plays such games is that it avoids populating new memory regions that get remapped into the address range that was being operated on by the original mmap or mlock calls. Let's remove VM_POPULATE as there isn't any strong argument to mandate a new vm_flag. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-23block: Convert some code to bio_for_each_segment_all()Kent Overstreet1-1/+1
More prep work for immutable bvecs: A few places in the code were either open coding or using the wrong version - fix. After we introduce the bvec iter, it'll no longer be possible to modify the biovec through bio_for_each_segment_all() - it doesn't increment a pointer to the current bvec, you pass in a struct bio_vec (not a pointer) which is updated with what the current biovec would be (taking into account bi_bvec_done and bi_size). So because of that it's more worthwhile to be consistent about bio_for_each_segment()/bio_for_each_segment_all() usage. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Alexander Viro <viro@zeniv.linux.org.uk>
2013-03-23block: Add bio_for_each_segment_all()Kent Overstreet1-1/+1
__bio_for_each_segment() iterates bvecs from the specified index instead of bio->bv_idx. Currently, the only usage is to walk all the bvecs after the bio has been advanced by specifying 0 index. For immutable bvecs, we need to split these apart; bio_for_each_segment() is going to have a different implementation. This will also help document the intent of code that's using it - bio_for_each_segment_all() is only legal to use for code that owns the bio. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Neil Brown <neilb@suse.de> CC: Boaz Harrosh <bharrosh@panasas.com>
2013-03-23bounce: Refactor __blk_queue_bounce to not use bi_io_vecKent Overstreet1-54/+19
A bunch of what __blk_queue_bounce() was doing was problematic for the immutable bvec work; this cleans that up and the code is quite a bit smaller, too. The __bio_for_each_segment() in copy_to_high_bio_irq() was changed because that one's looping over the original bio, not the bounce bio - a later patch renames __bio_for_each_segment() -> bio_for_each_segment_all(), and documents that bio_for_each_segment_all() is only for code that owns the bio. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk>
2013-03-23block: Remove bi_idx referencesKent Overstreet1-1/+0
For immutable bvecs, all bi_idx usage needs to be audited - so here we're removing all the unnecessary uses. Most of these are places where it was being initialized on a bio that was just allocated, a few others are conversions to standard macros. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk>
2013-03-22mm/hotplug: only free wait_table if it's allocated by vmallocJianguo Wu1-1/+5
zone->wait_table may be allocated from bootmem, it can not be freed. Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-22mm/hugetlb: fix total hugetlbfs pages count when using memory overcommit ↵Wanpeng Li1-2/+6
accouting hugetlb_total_pages is used for overcommit calculations but the current implementation considers only the default hugetlb page size (which is either the first defined hugepage size or the one specified by default_hugepagesz kernel boot parameter). If the system is configured for more than one hugepage size, which is possible since commit a137e1cc6d6e ("hugetlbfs: per mount huge page sizes") then the overcommit estimation done by __vm_enough_memory() (resp. shown by meminfo_proc_show) is not precise - there is an impression of more available/allowed memory. This can lead to an unexpected ENOMEM/EFAULT resp. SIGSEGV when memory is accounted. Testcase: boot: hugepagesz=1G hugepages=1 the default overcommit ratio is 50 before patch: egrep 'CommitLimit' /proc/meminfo CommitLimit: 55434168 kB after patch: egrep 'CommitLimit' /proc/meminfo CommitLimit: 54909880 kB [akpm@linux-foundation.org: coding-style tweak] Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> [3.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-18Merge branch 'master' into for-nextJiri Kosina47-1481/+2848
Sync with Linus' tree to be able to apply patch to the newly added ITG-3200 driver.
2013-03-14mm/fremap.c: fix possible oops on error pathMichel Lespinasse1-3/+2
The vm_flags introduced in 6d7825b10dbe ("mm/fremap.c: fix oops on error path") is supposed to avoid a compiler warning about unitialized vm_flags without changing the generated code. However I am concerned that this is going to be very brittle, and fail with some compiler versions. The failure could be either of: - compiler could actually load vma->vm_flags before checking for the !vma condition, thus reintroducing the oops - compiler could optimize out the !vma check, since the pointer just got dereferenced shortly before (so the compiler knows it can't be NULL!) I propose reversing this part of the change and initializing vm_flags to 0 just to avoid the bogus uninitialized use warning. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Tommi Rantala <tt.rantala@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-13mm/fremap.c: fix oops on error pathAndrew Morton1-2/+4
If find_vma() fails, sys_remap_file_pages() will dereference `vma', which contains NULL. Fix it by checking the pointer. (We could alternatively check for err==0, but this seems more direct) (The vm_flags change is to squish a bogus used-uninitialised warning without adding extra code). Reported-by: Tommi Rantala <tt.rantala@gmail.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-13mm: remove_memory(): fix end_pfn settingToshi Kani1-1/+1
remove_memory() calls walk_memory_range() with [start_pfn, end_pfn), where end_pfn is exclusive in this range. Therefore, end_pfn needs to be set to the next page of the end address. Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-12Select VIRT_TO_BUS directly where neededStephen Rothwell1-2/+6
In commit 887cbce0adea ("arch Kconfig: centralise ARCH_NO_VIRT_TO_BUS") I introduced the config sybmol HAVE_VIRT_TO_BUS and selected that where needed. I am not sure what I was thinking. Instead, just directly select VIRT_TO_BUS where it is needed. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-12Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and ↵Mathieu Desnoyers1-8/+0
security keys Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to compat_process_vm_rw() shows that the compatibility code requires an explicit "access_ok()" check before calling compat_rw_copy_check_uvector(). The same difference seems to appear when we compare fs/read_write.c:do_readv_writev() to fs/compat.c:compat_do_readv_writev(). This subtle difference between the compat and non-compat requirements should probably be debated, as it seems to be error-prone. In fact, there are two others sites that use this function in the Linux kernel, and they both seem to get it wrong: Now shifting our attention to fs/aio.c, we see that aio_setup_iocb() also ends up calling compat_rw_copy_check_uvector() through aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to be missing. Same situation for security/keys/compat.c:compat_keyctl_instantiate_key_iov(). I propose that we add the access_ok() check directly into compat_rw_copy_check_uvector(), so callers don't have to worry about it, and it therefore makes the compat call code similar to its non-compat counterpart. Place the access_ok() check in the same location where copy_from_user() can trigger a -EFAULT error in the non-compat code, so the ABI behaviors are alike on both compat and non-compat. While we are here, fix compat_do_readv_writev() so it checks for compat_rw_copy_check_uvector() negative return values. And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error handling. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-10make vfree() safe to call from interrupt contextsAl Viro1-5/+40
A bunch of RCU callbacks want to be able to do vfree() and end up with rather kludgy schemes. Just let vfree() do the right thing - put the victim on llist and schedule actual __vunmap() via schedule_work(), so that it runs from non-interrupt context. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-08memcg: initialize kmem-cache destroying work earlierKonstantin Khlebnikov1-2/+6
Fix a warning from lockdep caused by calling cancel_work_sync() for uninitialized struct work. This path has been triggered by destructon kmem-cache hierarchy via destroying its root kmem-cache. cache ffff88003c072d80 obj ffff88003b410000 cache ffff88003c072d80 obj ffff88003b924000 cache ffff88003c20bd40 INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. Pid: 2825, comm: insmod Tainted: G O 3.9.0-rc1-next-20130307+ #611 Call Trace: __lock_acquire+0x16a2/0x1cb0 lock_acquire+0x8a/0x120 flush_work+0x38/0x2a0 __cancel_work_timer+0x89/0xf0 cancel_work_sync+0xb/0x10 kmem_cache_destroy_memcg_children+0x81/0xb0 kmem_cache_destroy+0xf/0xe0 init_module+0xcb/0x1000 [kmem_test] do_one_initcall+0x11a/0x170 load_module+0x19b0/0x2320 SyS_init_module+0xc6/0xf0 system_call_fastpath+0x16/0x1b Example module to demonstrate: #include <linux/module.h> #include <linux/slab.h> #include <linux/mm.h> #include <linux/workqueue.h> int __init mod_init(void) { int size = 256; struct kmem_cache *cache; void *obj; struct page *page; cache = kmem_cache_create("kmem_cache_test", size, size, 0, NULL); if (!cache) return -ENOMEM; printk("cache %p\n", cache); obj = kmem_cache_alloc(cache, GFP_KERNEL); if (obj) { page = virt_to_head_page(obj); printk("obj %p cache %p\n", obj, page->slab_cache); kmem_cache_free(cache, obj); } flush_scheduled_work(); obj = kmem_cache_alloc(cache, GFP_KERNEL); if (obj) { page = virt_to_head_page(obj); printk("obj %p cache %p\n", obj, page->slab_cache); kmem_cache_free(cache, obj); } kmem_cache_destroy(cache); return -EBUSY; } module_init(mod_init); MODULE_LICENSE("GPL"); Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-08ksm: fix m68k build: only NUMA needs pfn_to_nidHugh Dickins1-1/+1
A CONFIG_DISCONTIGMEM=y m68k config gave mm/ksm.c: In function `get_kpfn_nid': mm/ksm.c:492: error: implicit declaration of function `pfn_to_nid' linux/mmzone.h declares it for CONFIG_SPARSEMEM and CONFIG_FLATMEM, but expects the arch's asm/mmzone.h to declare it for CONFIG_DISCONTIGMEM (see arch/mips/include/asm/mmzone.h for example). Or perhaps it is only expected when CONFIG_NUMA=y: too much of a maze, and m68k got away without it so far, so fix the build in mm/ksm.c. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-08mm/mempolicy.c: fix sp_node_init() argument orderingKOSAKI Motohiro1-1/+1
Currently, n_new is wrongly initialized. start and end parameter are inverted. Let's fix it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-08mm/mempolicy.c: fix wrong sp_node insertionHillf Danton1-1/+1
n->end is accessed in sp_insert(). Thus it should be update before calling sp_insert(). This mistake may make kernel panic. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Jones <davej@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-07hugetlb: fix sparse warning for hugetlb_register_nodeClaudiu Ghioc1-2/+2
Removed the following sparse warnings: * mm/hugetlb.c:1764:6: warning: symbol 'hugetlb_unregister_node' was not declared. Should it be static? * mm/hugetlb.c:1808:6: warning: symbol 'hugetlb_register_node' was not declared. Should it be static? Signed-off-by: Claudiu Ghioc <claudiu.ghioc@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-03-04make do_mremap() staticAl Viro1-2/+1
The extern in sys_sparc_64.c was a rudiment of time when do_mremap() used to exist in MMU case (it doesn't anymore). As for !MMU one, nothing uses it outside of mm/nommu.c... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-03teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long longAl Viro2-24/+3
... and convert a bunch of SYSCALL_DEFINE ones to SYSCALL_DEFINE<n>, killing the boilerplate crap around them. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-03Merge branch 'for-linus' of ↵Linus Torvalds1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more VFS bits from Al Viro: "Unfortunately, it looks like xattr series will have to wait until the next cycle ;-/ This pile contains 9p cleanups and fixes (races in v9fs_fid_add() etc), fixup for nommu breakage in shmem.c, several cleanups and a bit more file_inode() work" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: constify path_get/path_put and fs_struct.c stuff fix nommu breakage in shmem.c cache the value of file_inode() in struct file 9p: if v9fs_fid_lookup() gets to asking server, it'd better have hashed dentry 9p: make sure ->lookup() adds fid to the right dentry 9p: untangle ->lookup() a bit 9p: double iput() in ->lookup() if d_materialise_unique() fails 9p: v9fs_fid_add() can't fail now v9fs: get rid of v9fs_dentry 9p: turn fid->dlist into hlist 9p: don't bother with private lock in ->d_fsdata; dentry->d_lock will do just fine more file_inode() open-coded instances selinux: opened file can't have NULL or negative ->f_path.dentry (In the meantime, the hlist traversal macros have changed, so this required a semantic conflict fixup for the newly hlistified fid->dlist)
2013-03-02x86, ACPI, mm: Revert movablemem_map supportYinghai Lu2-330/+5
Tim found: WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80() Hardware name: S2600CP sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. smpboot: Booting Node 1, Processors #1 Modules linked in: Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1 Call Trace: set_cpu_sibling_map+0x279/0x449 start_secondary+0x11d/0x1e5 Don Morris reproduced on a HP z620 workstation, and bisected it to commit e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock is ready") It turns out movable_map has some problems, and it breaks several things 1. numa_init is called several times, NOT just for srat. so those nodes_clear(numa_nodes_parsed) memset(&numa_meminfo, 0, sizeof(numa_meminfo)) can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy. and make fall back path working. 2. simply split acpi_numa_init to early_parse_srat. a. that early_parse_srat is NOT called for ia64, so you break ia64. b. for (i = 0; i < MAX_LOCAL_APIC; i++) set_apicid_to_node(i, NUMA_NO_NODE) still left in numa_init. So it will just clear result from early_parse_srat. it should be moved before that.... c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved early before override from INITRD is settled. 3. that patch TITLE is total misleading, there is NO x86 in the title, but it changes critical x86 code. It caused x86 guys did not pay attention to find the problem early. Those patches really should be routed via tip/x86/mm. 4. after that commit, following range can not use movable ram: a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed? b. initrd... it will be freed after booting, so it could be on movable... c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G anymore. d. init_mem_mapping: can not put page table high anymore. e. initmem_init: vmemmap can not be high local node anymore. That is not good. If node is hotplugable, the mem related range like page table and vmemmap could be on the that node without problem and should be on that node. We have workaround patch that could fix some problems, but some can not be fixed. So just remove that offending commit and related ones including: f7210e6c4ac7 ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().") 01a178a94e8e ("acpi, memory-hotplug: support getting hotplug info from SRAT") 27168d38fa20 ("acpi, memory-hotplug: extend movablemem_map ranges to the end of node") e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock is ready") fb06bc8e5f42 ("page_alloc: bootmem limit with movablecore_map") 42f47e27e761 ("page_alloc: make movablemem_map have higher priority") 6981ec31146c ("page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes") 34b71f1e04fc ("page_alloc: add movable_memmap kernel parameter") 4d59a75125d5 ("x86: get pg_data_t's memory from other node") Later we should have patches that will make sure kernel put page table and vmemmap on local node ram instead of push them down to node0. Also need to find way to put other kernel used ram to local node ram. Reported-by: Tim Gardner <tim.gardner@canonical.com> Reported-by: Don Morris <don.morris@hp.com> Bisected-by: Don Morris <don.morris@hp.com> Tested-by: Don Morris <don.morris@hp.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-01fix nommu breakage in shmem.cAl Viro1-3/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-28Merge tag 'writeback-fixes' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fixes from Wu Fengguang: "Two writeback fixes - fix negative (setpoint - dirty) in 32bit archs - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()" * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: Negative (setpoint-dirty) in bdi_position_ratio() vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them
2013-02-28Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+2
Pull block IO core bits from Jens Axboe: "Below are the core block IO bits for 3.9. It was delayed a few days since my workstation kept crashing every 2-8h after pulling it into current -git, but turns out it is a bug in the new pstate code (divide by zero, will report separately). In any case, it contains: - The big cfq/blkcg update from Tejun and and Vivek. - Additional block and writeback tracepoints from Tejun. - Improvement of the should sort (based on queues) logic in the plug flushing. - _io() variants of the wait_for_completion() interface, using io_schedule() instead of schedule() to contribute to io wait properly. - Various little fixes. You'll get two trivial merge conflicts, which should be easy enough to fix up" Fix up the trivial conflicts due to hlist traversal cleanups (commit b67bfe0d42ca: "hlist: drop the node parameter from iterators"). * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits) block: remove redundant check to bd_openers() block: use i_size_write() in bd_set_size() cfq: fix lock imbalance with failed allocations drivers/block/swim3.c: fix null pointer dereference block: don't select PERCPU_RWSEM block: account iowait time when waiting for completion of IO request sched: add wait_for_completion_io[_timeout] writeback: add more tracepoints block: add block_{touch|dirty}_buffer tracepoint buffer: make touch_buffer() an exported function block: add @req to bio_{front|back}_merge tracepoints block: add missing block_bio_complete() tracepoint block: Remove should_sort judgement when flush blk_plug block,elevator: use new hashtable implementation cfq-iosched: add hierarchical cfq_group statistics cfq-iosched: collect stats from dead cfqgs cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock block: RCU free request_queue blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ...
2013-02-28slub: correctly bootstrap boot cachesGlauber Costa1-0/+6
After we create a boot cache, we may allocate from it until it is bootstraped. This will move the page from the partial list to the cpu slab list. If this happens, the loop: list_for_each_entry(p, &n->partial, lru) that we use to scan for all partial pages will yield nothing, and the pages will keep pointing to the boot cpu cache, which is of course, invalid. To do that, we should flush the cache to make sure that the cpu slab is back to the partial list. Signed-off-by: Glauber Costa <glommer@parallels.com> Reported-by: Steffen Michalke <StMichalke@web.de> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>