From f96c48670319d685d18d50819ed0c1ef751ed2ac Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Wed, 5 Jul 2023 18:14:00 -0700 Subject: mm: disable CONFIG_PER_VMA_LOCK until its fixed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A memory corruption was reported in [1] with bisection pointing to the patch [2] enabling per-VMA locks for x86. Disable per-VMA locks config to prevent this issue until the fix is confirmed. This is expected to be a temporary measure. [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624 [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com Link: https://lkml.kernel.org/r/20230706011400.2949242-3-surenb@google.com Reported-by: Jiri Slaby Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ Reported-by: Jacob Young Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Signed-off-by: Suren Baghdasaryan Cc: David Hildenbrand Cc: Holger Hoffstätte Cc: Signed-off-by: Andrew Morton --- mm/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/Kconfig b/mm/Kconfig index 09130434e30d..0abc6c71dd89 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1224,8 +1224,9 @@ config ARCH_SUPPORTS_PER_VMA_LOCK def_bool n config PER_VMA_LOCK - def_bool y + bool "Enable per-vma locking during page fault handling." depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP + depends on BROKEN help Allow per-vma locking during page fault handling. -- cgit v1.2.3 From 191fcdb6c9cf8b738b1628cbcf3af63d545c825c Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Fri, 30 Jun 2023 18:04:42 -0700 Subject: mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison The following crash happens for me when running the -mm selftests (below). Specifically, it happens while running the uffd-stress subtests: kernel BUG at mm/hugetlb.c:7249! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 3238 Comm: uffd-stress Not tainted 6.4.0-hubbard-github+ #109 Hardware name: ASUS X299-A/PRIME X299-A, BIOS 1503 08/03/2018 RIP: 0010:huge_pte_alloc+0x12c/0x1a0 ... Call Trace: ? __die_body+0x63/0xb0 ? die+0x9f/0xc0 ? do_trap+0xab/0x180 ? huge_pte_alloc+0x12c/0x1a0 ? do_error_trap+0xc6/0x110 ? huge_pte_alloc+0x12c/0x1a0 ? handle_invalid_op+0x2c/0x40 ? huge_pte_alloc+0x12c/0x1a0 ? exc_invalid_op+0x33/0x50 ? asm_exc_invalid_op+0x16/0x20 ? __pfx_put_prev_task_idle+0x10/0x10 ? huge_pte_alloc+0x12c/0x1a0 hugetlb_fault+0x1a3/0x1120 ? finish_task_switch+0xb3/0x2a0 ? lock_is_held_type+0xdb/0x150 handle_mm_fault+0xb8a/0xd40 ? find_vma+0x5d/0xa0 do_user_addr_fault+0x257/0x5d0 exc_page_fault+0x7b/0x1f0 asm_exc_page_fault+0x22/0x30 That happens because a BUG() statement in huge_pte_alloc() attempts to check that a pte, if present, is a hugetlb pte, but it does so in a non-lockless-safe manner that leads to a false BUG() report. We got here due to a couple of bugs, each of which by itself was not quite enough to cause a problem: First of all, before commit c33c794828f2("mm: ptep_get() conversion"), the BUG() statement in huge_pte_alloc() was itself fragile: it relied upon compiler behavior to only read the pte once, despite using it twice in the same conditional. Next, commit c33c794828f2 ("mm: ptep_get() conversion") broke that delicate situation, by causing all direct pte reads to be done via READ_ONCE(). And so READ_ONCE() got called twice within the same BUG() conditional, leading to comparing (potentially, occasionally) different versions of the pte, and thus to false BUG() reports. Fix this by taking a single snapshot of the pte before using it in the BUG conditional. Now, that commit is only partially to blame here but, people doing bisections will invariably land there, so this will help them find a fix for a real crash. And also, the previous behavior was unlikely to ever expose this bug--it was fragile, yet not actually broken. So that's why I chose this commit for the Fixes tag, rather than the commit that created the original BUG() statement. Link: https://lkml.kernel.org/r/20230701010442.2041858-1-jhubbard@nvidia.com Fixes: c33c794828f2 ("mm: ptep_get() conversion") Signed-off-by: John Hubbard Acked-by: James Houghton Acked-by: Muchun Song Reviewed-by: Ryan Roberts Acked-by: Mike Kravetz Cc: Adrian Hunter Cc: Al Viro Cc: Alex Williamson Cc: Alexander Potapenko Cc: Alexander Shishkin Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Christian Brauner Cc: Christoph Hellwig Cc: Daniel Vetter Cc: Dave Airlie Cc: Dimitri Sivanich Cc: Dmitry Vyukov Cc: Ian Rogers Cc: Jason Gunthorpe Cc: Jiri Olsa Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Lorenzo Stoakes Cc: Mark Rutland Cc: Matthew Wilcox Cc: Miaohe Lin Cc: Michal Hocko Cc: Mike Rapoport (IBM) Cc: Namhyung Kim Cc: Naoya Horiguchi Cc: Oleksandr Tyshchenko Cc: Pavel Tatashin Cc: Roman Gushchin Cc: SeongJae Park Cc: Shakeel Butt Cc: Uladzislau Rezki (Sony) Cc: Vincenzo Frascino Cc: Yu Zhao Signed-off-by: Andrew Morton --- mm/hugetlb.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bce28cca73a1..64a3239b6407 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7246,7 +7246,12 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, pte = (pte_t *)pmd_alloc(mm, pud, addr); } } - BUG_ON(pte && pte_present(ptep_get(pte)) && !pte_huge(ptep_get(pte))); + + if (pte) { + pte_t pteval = ptep_get_lockless(pte); + + BUG_ON(pte_present(pteval) && !pte_huge(pteval)); + } return pte; } -- cgit v1.2.3 From 6dca4ac6fc91fd41ea4d6c4511838d37f4e0eab2 Mon Sep 17 00:00:00 2001 From: Peter Collingbourne Date: Mon, 22 May 2023 17:43:08 -0700 Subject: mm: call arch_swap_restore() from do_swap_page() Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved the call to swap_free() before the call to set_pte_at(), which meant that the MTE tags could end up being freed before set_pte_at() had a chance to restore them. Fix it by adding a call to the arch_swap_restore() hook before the call to swap_free(). Link: https://lkml.kernel.org/r/20230523004312.1807357-2-pcc@google.com Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965 Fixes: c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") Signed-off-by: Peter Collingbourne Reported-by: Qun-wei Lin Closes: https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/ Acked-by: David Hildenbrand Acked-by: "Huang, Ying" Reviewed-by: Steven Price Acked-by: Catalin Marinas Cc: [6.1+] Signed-off-by: Andrew Morton --- mm/memory.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 0ae594703021..01f39e8144ef 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3950,6 +3950,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } } + /* + * Some architectures may have to restore extra metadata to the page + * when reading from swap. This metadata may be indexed by swap entry + * so this must be called before swap_free(). + */ + arch_swap_restore(entry, folio); + /* * Remove the swap entry and conditionally try to free up the swapcache. * We're already holding a reference on the page but haven't mapped it -- cgit v1.2.3 From 8344a3d44be3d18671e18c4ba23bb03dd21e14ad Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 28 Jun 2023 19:55:48 +0100 Subject: writeback: account the number of pages written back nr_to_write is a count of pages, so we need to decrease it by the number of pages in the folio we just wrote, not by 1. Most callers specify either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() might end up writing 512x as many pages as it asked for. Dave added: : XFS is the only filesystem this would affect, right? AFAIA, nothing : else enables large folios and uses writeback through : write_cache_pages() at this point... : : In which case, I'd be surprised if much difference, if any, gets : noticed by anyone. Link: https://lkml.kernel.org/r/20230628185548.981888-1-willy@infradead.org Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig Cc: Jan Kara Cc: Dave Chinner Signed-off-by: Andrew Morton --- mm/page-writeback.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 1d17fb1ec863..d3f42009bb70 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2434,6 +2434,7 @@ int write_cache_pages(struct address_space *mapping, for (i = 0; i < nr_folios; i++) { struct folio *folio = fbatch.folios[i]; + unsigned long nr; done_index = folio->index; @@ -2471,6 +2472,7 @@ continue_unlock: trace_wbc_writepage(wbc, inode_to_bdi(mapping->host)); error = writepage(folio, wbc, data); + nr = folio_nr_pages(folio); if (unlikely(error)) { /* * Handle errors according to the type of @@ -2489,8 +2491,7 @@ continue_unlock: error = 0; } else if (wbc->sync_mode != WB_SYNC_ALL) { ret = error; - done_index = folio->index + - folio_nr_pages(folio); + done_index = folio->index + nr; done = 1; break; } @@ -2504,7 +2505,8 @@ continue_unlock: * keep going until we have written all the pages * we tagged for writeback prior to entering this loop. */ - if (--wbc->nr_to_write <= 0 && + wbc->nr_to_write -= nr; + if (wbc->nr_to_write <= 0 && wbc->sync_mode == WB_SYNC_NONE) { done = 1; break; -- cgit v1.2.3 From 05c56e7b4319d7f6352f27da876a1acdc8fa5cc4 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Tue, 4 Jul 2023 02:52:05 +0200 Subject: kasan: fix type cast in memory_is_poisoned_n Commit bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13 builtins") introduced a bug into the memory_is_poisoned_n implementation: it effectively removed the cast to a signed integer type after applying KASAN_GRANULE_MASK. As a result, KASAN started failing to properly check memset, memcpy, and other similar functions. Fix the bug by adding the cast back (through an additional signed integer variable to make the code more readable). Link: https://lkml.kernel.org/r/8c9e0251c2b8b81016255709d4ec42942dcaf018.1688431866.git.andreyknvl@google.com Fixes: bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13 builtins") Signed-off-by: Andrey Konovalov Cc: Alexander Potapenko Cc: Andrey Ryabinin Cc: Arnd Bergmann Cc: Dmitry Vyukov Cc: Marco Elver Cc: Signed-off-by: Andrew Morton --- mm/kasan/generic.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c index 5b4c97baa656..4d837ab83f08 100644 --- a/mm/kasan/generic.c +++ b/mm/kasan/generic.c @@ -130,9 +130,10 @@ static __always_inline bool memory_is_poisoned_n(const void *addr, size_t size) if (unlikely(ret)) { const void *last_byte = addr + size - 1; s8 *last_shadow = (s8 *)kasan_mem_to_shadow(last_byte); + s8 last_accessible_byte = (unsigned long)last_byte & KASAN_GRANULE_MASK; if (unlikely(ret != (unsigned long)last_shadow || - (((long)last_byte & KASAN_GRANULE_MASK) >= *last_shadow))) + last_accessible_byte >= *last_shadow)) return true; } return false; -- cgit v1.2.3 From fdb54d96600aafe45951f549866cd6fc1af59954 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Wed, 5 Jul 2023 14:44:02 +0200 Subject: kasan, slub: fix HW_TAGS zeroing with slub_debug Commit 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated kmalloc space than requested") added precise kmalloc redzone poisoning to the slub_debug functionality. However, this commit didn't account for HW_TAGS KASAN fully initializing the object via its built-in memory initialization feature. Even though HW_TAGS KASAN memory initialization contains special memory initialization handling for when slub_debug is enabled, it does not account for in-object slub_debug redzones. As a result, HW_TAGS KASAN can overwrite these redzones and cause false-positive slub_debug reports. To fix the issue, avoid HW_TAGS KASAN memory initialization when slub_debug is enabled altogether. Implement this by moving the __slub_debug_enabled check to slab_post_alloc_hook. Common slab code seems like a more appropriate place for a slub_debug check anyway. Link: https://lkml.kernel.org/r/678ac92ab790dba9198f9ca14f405651b97c8502.1688561016.git.andreyknvl@google.com Fixes: 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated kmalloc space than requested") Signed-off-by: Andrey Konovalov Reported-by: Will Deacon Acked-by: Marco Elver Cc: Mark Rutland Cc: Alexander Potapenko Cc: Andrey Ryabinin Cc: Catalin Marinas Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Feng Tang Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim Cc: kasan-dev@googlegroups.com Cc: Pekka Enberg Cc: Peter Collingbourne Cc: Roman Gushchin Cc: Vincenzo Frascino Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton --- mm/kasan/kasan.h | 12 ------------ mm/slab.h | 16 ++++++++++++++-- 2 files changed, 14 insertions(+), 14 deletions(-) (limited to 'mm') diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h index b799f11e45dc..2e973b36fe07 100644 --- a/mm/kasan/kasan.h +++ b/mm/kasan/kasan.h @@ -466,18 +466,6 @@ static inline void kasan_unpoison(const void *addr, size_t size, bool init) if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK)) return; - /* - * Explicitly initialize the memory with the precise object size to - * avoid overwriting the slab redzone. This disables initialization in - * the arch code and may thus lead to performance penalty. This penalty - * does not affect production builds, as slab redzones are not enabled - * there. - */ - if (__slub_debug_enabled() && - init && ((unsigned long)size & KASAN_GRANULE_MASK)) { - init = false; - memzero_explicit((void *)addr, size); - } size = round_up(size, KASAN_GRANULE_SIZE); hw_set_mem_tag_range((void *)addr, size, tag, init); diff --git a/mm/slab.h b/mm/slab.h index 6a5633b25eb5..9c0e09d0f81f 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -723,6 +723,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, unsigned int orig_size) { unsigned int zero_size = s->object_size; + bool kasan_init = init; size_t i; flags &= gfp_allowed_mask; @@ -739,6 +740,17 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, (s->flags & SLAB_KMALLOC)) zero_size = orig_size; + /* + * When slub_debug is enabled, avoid memory initialization integrated + * into KASAN and instead zero out the memory via the memset below with + * the proper size. Otherwise, KASAN might overwrite SLUB redzones and + * cause false-positive reports. This does not lead to a performance + * penalty on production builds, as slub_debug is not intended to be + * enabled there. + */ + if (__slub_debug_enabled()) + kasan_init = false; + /* * As memory initialization might be integrated into KASAN, * kasan_slab_alloc and initialization memset must be @@ -747,8 +759,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, * As p[i] might get tagged, memset and kmemleak hook come after KASAN. */ for (i = 0; i < size; i++) { - p[i] = kasan_slab_alloc(s, p[i], flags, init); - if (p[i] && init && !kasan_has_integrated_init()) + p[i] = kasan_slab_alloc(s, p[i], flags, kasan_init); + if (p[i] && init && (!kasan_init || !kasan_has_integrated_init())) memset(p[i], 0, zero_size); kmemleak_alloc_recursive(p[i], s->object_size, 1, s->flags, flags); -- cgit v1.2.3