diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2020-01-31 12:16:36 -0800 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2020-01-31 12:16:36 -0800 |
commit | 7eec11d3a784a283f916590e5aa30b855c2ccfd7 (patch) | |
tree | e1bafb0d159b787684e392ae613933f9211c7d7a | |
parent | ddaefe8947b48b638f726cf89730ecc1000ebcc3 (diff) | |
parent | 43e76af85fa7e75ac9b71fc2fcc250abb1889bff (diff) |
Merge branch 'akpm' (patches from Andrew)
Pull updates from Andrew Morton:
"Most of -mm and quite a number of other subsystems: hotfixes, scripts,
ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov.
MM is fairly quiet this time. Holidays, I assume"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
kcov: ignore fault-inject and stacktrace
include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
execve: warn if process starts with executable stack
reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
init/main.c: fix misleading "This architecture does not have kernel memory protection" message
init/main.c: fix quoted value handling in unknown_bootoption
init/main.c: remove unnecessary repair_env_string in do_initcall_level
init/main.c: log arguments and environment passed to init
fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
fs/binfmt_elf.c: coredump: delete duplicated overflow check
fs/binfmt_elf.c: coredump: allocate core ELF header on stack
fs/binfmt_elf.c: make BAD_ADDR() unlikely
fs/binfmt_elf.c: better codegen around current->mm
fs/binfmt_elf.c: don't copy ELF header around
fs/binfmt_elf.c: fix ->start_code calculation
fs/binfmt_elf.c: smaller code generation around auxv vector fill
lib/find_bit.c: uninline helper _find_next_bit()
lib/find_bit.c: join _find_next_bit{_le}
uapi: rename ext2_swab() to swab() and share globally in swab.h
lib/scatterlist.c: adjust indentation in __sg_alloc_table
...
136 files changed, 2724 insertions, 1293 deletions
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ec92120a7952..ddc5ccdd4cd1 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -834,6 +834,18 @@ dump out devices still on the deferred probe list after retrying. + dfltcc= [HW,S390] + Format: { on | off | def_only | inf_only | always } + on: s390 zlib hardware support for compression on + level 1 and decompression (default) + off: No s390 zlib hardware support + def_only: s390 zlib hardware support for deflate + only (compression on level 1) + inf_only: s390 zlib hardware support for inflate + only (decompression) + always: Same as 'on' but ignores the selected compression + level always using hardware support (used for debugging) + dhash_entries= [KNL] Set number of hash buckets for dentry cache. diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index bc0c727d7fd8..a501dc1c90d0 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -31,6 +31,7 @@ Core utilities generic-radix-tree memory-allocation mm-api + pin_user_pages gfp_mask-from-fs-io timekeeping boot-time-mm diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst new file mode 100644 index 000000000000..1d490155ecd7 --- /dev/null +++ b/Documentation/core-api/pin_user_pages.rst @@ -0,0 +1,232 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================================== +pin_user_pages() and related calls +==================================================== + +.. contents:: :local: + +Overview +======== + +This document describes the following functions:: + + pin_user_pages() + pin_user_pages_fast() + pin_user_pages_remote() + +Basic description of FOLL_PIN +============================= + +FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() +("gup") family of functions. FOLL_PIN has significant interactions and +interdependencies with FOLL_LONGTERM, so both are covered here. + +FOLL_PIN is internal to gup, meaning that it should not appear at the gup call +sites. This allows the associated wrapper functions (pin_user_pages*() and +others) to set the correct combination of these flags, and to check for problems +as well. + +FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. +This is in order to avoid creating a large number of wrapper functions to cover +all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the +pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so +that's a natural dividing line, and a good point to make separate wrapper calls. +In other words, use pin_user_pages*() for DMA-pinned pages, and +get_user_pages*() for other cases. There are four cases described later on in +this document, to further clarify that concept. + +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, +multiple threads and call sites are free to pin the same struct pages, via both +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the +other, not the struct page(s). + +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN +uses a different reference counting technique. + +FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, +FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. + +Which flags are set by each wrapper +=================================== + +For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup +flags the caller provides. The caller is required to pass in a non-null struct +pages* array, and the function then pin pages by incrementing each by a special +value. For now, that value is +1, just like get_user_pages*().:: + + Function + -------- + pin_user_pages FOLL_PIN is always set internally by this function. + pin_user_pages_fast FOLL_PIN is always set internally by this function. + pin_user_pages_remote FOLL_PIN is always set internally by this function. + +For these get_user_pages*() functions, FOLL_GET might not even be specified. +Behavior is a little more complex than above. If FOLL_GET was *not* specified, +but the caller passed in a non-null struct pages* array, then the function +sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount +of each page by +1.:: + + Function + -------- + get_user_pages FOLL_GET is sometimes set internally by this function. + get_user_pages_fast FOLL_GET is sometimes set internally by this function. + get_user_pages_remote FOLL_GET is sometimes set internally by this function. + +Tracking dma-pinned pages +========================= + +Some of the key design constraints, and solutions, for tracking dma-pinned +pages: + +* An actual reference count, per struct page, is required. This is because + multiple processes may pin and unpin a page. + +* False positives (reporting that a page is dma-pinned, when in fact it is not) + are acceptable, but false negatives are not. + +* struct page may not be increased in size for this, and all fields are already + used. + +* Given the above, we can overload the page->_refcount field by using, sort of, + the upper bits in that field for a dma-pinned count. "Sort of", means that, + rather than dividing page->_refcount into bit fields, we simple add a medium- + large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to + page->_refcount. This provides fuzzy behavior: if a page has get_page() called + on it 1024 times, then it will appear to have a single dma-pinned count. + And again, that's acceptable. + +This also leads to limitations: there are only 31-10==21 bits available for a +counter that increments 10 bits at a time. + +TODO: for 1GB and larger huge pages, this is cutting it close. That's because +when pin_user_pages() follows such pages, it increments the head page by "1" +(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for +pin_user_pages()) for each tail page. So if you have a 1GB huge page: + +* There are 256K (18 bits) worth of 4 KB tail pages. +* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, + 10 bits at a time) +* There are 21 - 18 == 3 bits available to count. Except that there aren't, + because you need to allow for a few normal get_page() calls on the head page, + as well. Fortunately, the approach of using addition, rather than "hard" + bitfields, within page->_refcount, allows for sharing these bits gracefully. + But we're still looking at about 8 references. + +This, however, is a missing feature more than anything else, because it's easily +solved by addressing an obvious inefficiency in the original get_user_pages() +approach of retrieving pages: stop treating all the pages as if they were +PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of +this, so some work is required. Once that's in place, this limitation mostly +disappears from view, because there will be ample refcounting range available. + +* Callers must specifically request "dma-pinned tracking of pages". In other + words, just calling get_user_pages() will not suffice; a new set of functions, + pin_user_page() and related, must be used. + +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags +========================================================== + +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing +these categories: + +CASE 1: Direct IO (DIO) +----------------------- +There are GUP references to pages that are serving +as DIO buffers. These buffers are needed for a relatively short time (so they +are not "long term"). No special synchronization with page_mkclean() or +munmap() is provided. Therefore, flags to set at the call site are: :: + + FOLL_PIN + +...but rather than setting FOLL_PIN directly, call sites should use one of +the pin_user_pages*() routines that set FOLL_PIN. + +CASE 2: RDMA +------------ +There are GUP references to pages that are serving as DMA +buffers. These buffers are needed for a long time ("long term"). No special +synchronization with page_mkclean() or munmap() is provided. Therefore, flags +to set at the call site are: :: + + FOLL_PIN | FOLL_LONGTERM + +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's +because DAX pages do not have a separate page cache, and so "pinning" implies +locking down file system blocks, which is not (yet) supported in that way. + +CASE 3: Hardware with page faulting support +------------------------------------------- +Here, a well-written driver doesn't normally need to pin pages at all. However, +if the driver does choose to do so, it can register MMU notifiers for the range, +and will be called back upon invalidation. Either way (avoiding page pinning, or +using MMU notifiers to unpin upon request), there is proper synchronization with +both filesystem and mm (page_mkclean(), munmap(), etc). + +Therefore, neither flag needs to be set. + +In this case, ideally, neither get_user_pages() nor pin_user_pages() should be +called. Instead, the software should be written so that it does not pin pages. +This allows mm and filesystems to operate more efficiently and reliably. + +CASE 4: Pinning for struct page manipulation only +------------------------------------------------- +Here, normal GUP calls are sufficient, so neither flag needs to be set. + +page_dma_pinned(): the whole point of pinning +============================================= + +The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able +to query, "is this page DMA-pinned?" That allows code such as page_mkclean() +(and file system writeback code in general) to make informed decisions about +what to do when a page cannot be unmapped due to such pins. + +What to do in those cases is the subject of a years-long series of discussions +and debates (see the References at the end of this document). It's a TODO item +here: fill in the details once that's worked out. Meanwhile, it's safe to say +that having this available: :: + + static inline bool page_dma_pinned(struct page *page) + +...is a prerequisite to solving the long-running gup+DMA problem. + +Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM +=================================================================== + +Another way of thinking about these flags is as a progression of restrictions: +FOLL_GET is for struct page manipulation, without affecting the data that the +struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for +short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is +a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more +restrictive case that has FOLL_PIN as a prerequisite: this is for pages that +will be pinned longterm, and whose data will be accessed. + +Unit testing +============ +This file:: + + tools/testing/selftests/vm/gup_benchmark.c + +has the following new calls to exercise the new pin*() wrapper functions: + +* PIN_FAST_BENCHMARK (./gup_benchmark -a) +* PIN_BENCHMARK (./gup_benchmark -b) + +You can monitor how many total dma-pinned pages have been acquired and released +since the system was booted, via two new /proc/vmstat entries: :: + + /proc/vmstat/nr_foll_pin_requested + /proc/vmstat/nr_foll_pin_requested + +Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is +because there is a noticeable performance drop in unpin_user_page(), when they +are activated. + +References +========== + +* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ +* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ +* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ + +John Hubbard, October, 2019 diff --git a/Documentation/vm/zswap.rst b/Documentation/vm/zswap.rst index 1444ecd40911..61f6185188cd 100644 --- a/Documentation/vm/zswap.rst +++ b/Documentation/vm/zswap.rst @@ -130,6 +130,19 @@ checking for the same-value filled pages during store operation. However, the existing pages which are marked as same-value filled pages remain stored unchanged in zswap until they are either loaded or invalidated. +To prevent zswap from shrinking pool when zswap is full and there's a high +pressure on swap (this will result in flipping pages in and out zswap pool +without any real benefit but with a performance drop for the system), a +special parameter has been introduced to implement a sort of hysteresis to +refuse taking pages into zswap pool until it has sufficient space if the limit +has been hit. To set the threshold at which zswap would start accepting pages +again after it became full, use the sysfs ``accept_threhsold_percent`` +attribute, e. g.:: + + echo 80 > /sys/module/zswap/parameters/accept_threhsold_percent + +Setting this parameter to 100 will disable the hysteresis. + A debugfs interface is provided for various statistic about pool size, number of pages stored, same-value filled pages and various counters for the reasons pages are rejected. diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c index 56cc84520577..eba73ebd8ae5 100644 --- a/arch/powerpc/mm/book3s64/iommu_api.c +++ b/arch/powerpc/mm/book3s64/iommu_api.c @@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, for (entry = 0; entry < entries; entry += chunk) { unsigned long n = min(entries - entry, chunk); - ret = get_user_pages(ua + (entry << PAGE_SHIFT), n, + ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n, FOLL_WRITE | FOLL_LONGTERM, mem->hpages + entry, NULL); if (ret == n) { @@ -167,9 +167,8 @@ good_exit: return 0; free_exit: - /* free the reference taken */ - for (i = 0; i < pinned; i++) - put_page(mem->hpages[i]); + /* free the references taken */ + unpin_user_pages(mem->hpages, pinned); vfree(mem->hpas); kfree(mem); @@ -215,7 +214,8 @@ static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem) if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY) SetPageDirty(page); - put_page(page); + unpin_user_page(page); + mem->hpas[i] = 0; } } diff --git a/arch/s390/boot/compressed/decompressor.c b/arch/s390/boot/compressed/decompressor.c index 45046630c56a..368fd372c875 100644 --- a/arch/s390/boot/compressed/decompressor.c +++ b/arch/s390/boot/compressed/decompressor.c @@ -30,13 +30,13 @@ extern unsigned char _compressed_start[]; extern unsigned char _compressed_end[]; #ifdef CONFIG_HAVE_KERNEL_BZIP2 -#define HEAP_SIZE 0x400000 +#define BOOT_HEAP_SIZE 0x400000 #else -#define HEAP_SIZE 0x10000 +#define BOOT_HEAP_SIZE 0x10000 #endif static unsigned long free_mem_ptr = (unsigned long) _end; -static unsigned long free_mem_end_ptr = (unsigned long) _end + HEAP_SIZE; +static unsigned long free_mem_end_ptr = (unsigned long) _end + BOOT_HEAP_SIZE; #ifdef CONFIG_KERNEL_GZIP #include "../../../../lib/decompress_inflate.c" @@ -62,7 +62,7 @@ static unsigned long free_mem_end_ptr = (unsigned long) _end + HEAP_SIZE; #include "../../../../lib/decompress_unxz.c" #endif -#define decompress_offset ALIGN((unsigned long)_end + HEAP_SIZE, PAGE_SIZE) +#define decompress_offset ALIGN((unsigned long)_end + BOOT_HEAP_SIZE, PAGE_SIZE) unsigned long mem_safe_offset(void) { diff --git a/arch/s390/boot/ipl_parm.c b/arch/s390/boot/ipl_parm.c index 24ef67eb1cef..357adad991d2 100644 --- a/arch/s390/boot/ipl_parm.c +++ b/arch/s390/boot/ipl_parm.c @@ -14,6 +14,7 @@ char __bootdata(early_command_line)[COMMAND_LINE_SIZE]; struct ipl_parameter_block __bootdata_preserved(ipl_block); int __bootdata_preserved(ipl_block_valid); +unsigned int __bootdata_preserved(zlib_dfltcc_support) = ZLIB_DFLTCC_FULL; unsigned long __bootdata(vmalloc_size) = VMALLOC_DEFAULT_SIZE; unsigned long __bootdata(memory_end); @@ -229,6 +230,19 @@ void parse_boot_command_line(void) if (!strcmp(param, "vmalloc") && val) vmalloc_size = round_up(memparse(val, NULL), PAGE_SIZE); + if (!strcmp(param, "dfltcc")) { + if (!strcmp(val, "off")) + zlib_dfltcc_support = ZLIB_DFLTCC_DISABLED; + else if (!strcmp(val, "on")) + zlib_dfltcc_support = ZLIB_DFLTCC_FULL; + else if (!strcmp(val, "def_only")) + zlib_dfltcc_support = ZLIB_DFLTCC_DEFLATE_ONLY; + else if (!strcmp(val, "inf_only")) + zlib_dfltcc_support = ZLIB_DFLTCC_INFLATE_ONLY; + else if (!strcmp(val, "always")) + zlib_dfltcc_support = ZLIB_DFLTCC_FULL_DEBUG; + } + if (!strcmp(param, "noexec")) { rc = kstrtobool(val, &enabled); if (!rc && !enabled) diff --git a/arch/s390/include/asm/setup.h b/arch/s390/include/asm/setup.h index 69289e99cabd..b241ddb67caf 100644 --- a/arch/s390/include/asm/setup.h +++ b/arch/s390/include/asm/setup.h @@ -79,6 +79,13 @@ struct parmarea { char command_line[ARCH_COMMAND_LINE_SIZE]; /* 0x10480 */ }; +extern unsigned int zlib_dfltcc_support; +#define ZLIB_DFLTCC_DISABLED 0 +#define ZLIB_DFLTCC_FULL 1 +#define ZLIB_DFLTCC_DEFLATE_ONLY 2 +#define ZLIB_DFLTCC_INFLATE_ONLY 3 +#define ZLIB_DFLTCC_FULL_DEBUG 4 + extern int noexec_disabled; extern int memory_end_set; extern unsigned long memory_end; diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c index 87a467dff5eb..b2c2f75860e8 100644 --- a/arch/s390/kernel/setup.c +++ b/arch/s390/kernel/setup.c @@ -111,6 +111,8 @@ unsigned long __bootdata_preserved(__etext_dma); unsigned long __bootdata_preserved(__sdma); unsigned long __bootdata_preserved(__edma); unsigned long __bootdata_preserved(__kaslr_offset); +unsigned int __bootdata_preserved(zlib_dfltcc_support); +EXPORT_SYMBOL(zlib_dfltcc_support); unsigned long VMALLOC_START; EXPORT_SYMBOL(VMALLOC_START); @@ -759,14 +761,6 @@ static void __init free_mem_detect_info(void) memblock_free(start, size); } -static void __init memblock_physmem_add(phys_addr_t start, phys_addr_t size) -{ - memblock_dbg("memblock_physmem_add: [%#016llx-%#016llx]\n", - start, start + size - 1); - memblock_add_range(&memblock.memory, start, size, 0, 0); - memblock_add_range(&memblock.physmem, start, size, 0, 0); -} - static const char * __init get_mem_info_source(void) { switch (mem_detect.info_source) { @@ -791,8 +785,10 @@ static void __init memblock_add_mem_detect_info(void) get_mem_info_source(), mem_detect.info_source); /* keep memblock lists close to the kernel */ memblock_set_bottom_up(true); - for_each_mem_detect_block(i, &start, &end) + for_each_mem_detect_block(i, &start, &end) { + memblock_add(start, end - start); memblock_physmem_add(start, end - start); + } memblock_set_bottom_up(false); memblock_dump_all(); } diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c index d831a61e0010..19067a5e5293 100644 --- a/drivers/acpi/thermal.c +++ b/drivers/acpi/thermal.c @@ -27,6 +27,7 @@ #include <linux/acpi.h> #include <linux/workqueue.h> #include <linux/uaccess.h> +#include <linux/units.h> #define PREFIX "ACPI: " @@ -172,7 +173,7 @@ struct acpi_thermal { struct acpi_handle_list devices; struct thermal_zone_device *thermal_zone; int tz_enabled; - int kelvin_offset; + int kelvin_offset; /* in millidegrees */ struct work_struct thermal_check_work; }; @@ -297,7 +298,8 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag) if (crt == -1) { tz->trips.critical.flags.valid = 0; } else if (crt > 0) { - unsigned long crt_k = CELSIUS_TO_DECI_KELVIN(crt); + unsigned long crt_k = celsius_to_deci_kelvin(crt); + /* * Allow override critical threshold */ @@ -333,7 +335,7 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag) if (psv == -1) { status = AE_SUPPORT; } else if (psv > 0) { - tmp = CELSIUS_TO_DECI_KELVIN(psv); + tmp = celsius_to_deci_kelvin(psv); status = AE_OK; } else { status = acpi_evaluate_integer(tz->device->handle, @@ -413,7 +415,7 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag) break; if (i == 1) tz->trips.active[0].temperature = - CELSIUS_TO_DECI_KELVIN(act); + celsius_to_deci_kelvin(act); else /* * Don't allow override higher than @@ -421,9 +423,9 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag) */ tz->trips.active[i - 1].temperature = (tz->trips.active[i - 2].temperature < - CELSIUS_TO_DECI_KELVIN(act) ? + celsius_to_deci_kelvin(act) ? tz->trips.active[i - 2].temperature : - CELSIUS_TO_DECI_KELVIN(act)); + celsius_to_deci_kelvin(act)); break; } else { tz->trips.active[i].temperature = tmp; @@ -519,7 +521,7 @@ static int thermal_get_temp(struct thermal_zone_device *thermal, int *temp) if (result) return result; - *temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(tz->temperature, + *temp = deci_kelvin_to_millicelsius_with_offset(tz->temperature, tz->kelvin_offset); return 0; } @@ -624,7 +626,7 @@ static int thermal_get_trip_temp(struct thermal_zone_device *thermal, if (tz->trips.critical.flags.valid) { if (!trip) { - *temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET( + *temp = deci_kelvin_to_millicelsius_with_offset( tz->trips.critical.temperature, tz->kelvin_offset); return 0; @@ -634,7 +636,7 @@ static int thermal_get_trip_temp(struct thermal_zone_device *thermal, if (tz->trips.hot.flags.valid) { if (!trip) { - *temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET( + *temp = deci_kelvin_to_millicelsius_with_offset( tz->trips.hot.temperature, tz->kelvin_offset); return 0; @@ -644,7 +646,7 @@ static int thermal_get_trip_temp(struct thermal_zone_device *thermal, if (tz->trips.passive.flags.valid) { if (!trip) { - *temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET( + *temp = deci_kelvin_to_millicelsius_with_offset( tz->trips.passive.temperature, tz->kelvin_offset); return 0; @@ -655,7 +657,7 @@ static int thermal_get_trip_temp(struct thermal_zone_device *thermal, for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE && tz->trips.active[i].flags.valid; i++) { if (!trip) { - *temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET( + *temp = deci_kelvin_to_millicelsius_with_offset( tz->trips.active[i].temperature, tz->kelvin_offset); return 0; @@ -672,7 +674,7 @@ static int thermal_get_crit_temp(struct thermal_zone_device *thermal, struct acpi_thermal *tz = thermal->devdata; if (tz->trips.critical.flags.valid) { - *temperature = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET( + *temperature = deci_kelvin_to_millicelsius_with_offset( tz->trips.critical.temperature, tz->kelvin_offset); return 0; @@ -692,7 +694,7 @@ static int thermal_get_trend(struct thermal_zone_device *thermal, if (type == THERMAL_TRIP_ACTIVE) { int trip_temp; - int temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET( + int temp = deci_kelvin_to_millicelsius_with_offset( tz->temperature, tz->kelvin_offset); if (thermal_get_trip_temp(thermal, trip, &trip_temp)) return -EINVAL; @@ -1043,9 +1045,9 @@ static void acpi_thermal_guess_offset(struct acpi_thermal *tz) { if (tz->trips.critical.flags.valid && (tz->trips.critical.temperature % 5) == 1) - tz->kelvin_offset = 2731; + tz->kelvin_offset = 273100; else - tz->kelvin_offset = 2732; + tz->kelvin_offset = 273200; } static void acpi_thermal_check_fn(struct work_struct *work) @@ -1087,7 +1089,7 @@ static int acpi_thermal_add(struct acpi_device *device) INIT_WORK(&tz->thermal_check_work, acpi_thermal_check_fn); pr_info(PREFIX "%s [%s] (%ld C)\n", acpi_device_name(device), - acpi_device_bid(device), DECI_KELVIN_TO_CELSIUS(tz->temperature)); + acpi_device_bid(device), deci_kelvin_to_celsius(tz->temperature)); goto end; free_memory: diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 799b43191dea..15659306ad69 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -70,20 +70,6 @@ void unregister_memory_notifier(struct notifier_block *nb) } EXPORT_SYMBOL(unregister_memory_notifier); -static ATOMIC_NOTIFIER_HEAD(memory_isolate_chain); - -int register_memory_isolate_notifier(struct notifier_block *nb) -{ - return atomic_notifier_chain_register(&memory_isolate_chain, nb); -} -EXPORT_SYMBOL(register_memory_isolate_notifier); - -void unregister_memory_isolate_notifier(struct notifier_block *nb) -{ - atomic_notifier_chain_unregister(&memory_isolate_chain, nb); -} -EXPORT_SYMBOL(unregister_memory_isolate_notifier); - static void memory_block_release(struct device *dev) { struct memory_block *mem = to_memory_block(dev); @@ -175,11 +161,6 @@ int memory_notify(unsigned long val, void *v) return blocking_notifier_call_chain(&memory_chain, val, v); } -int memory_isolate_notify(unsigned long val, void *v) -{ - return atomic_notifier_call_chain(&memory_isolate_chain, val, v); -} - /* * The probe routines leave the pages uninitialized, just as the bootmem code * does. Make sure we do not access them, but instead use only information from @@ -225,7 +206,7 @@ static bool pages_correctly_probed(unsigned long start_pfn) */ static int memory_block_action(unsigned long start_section_nr, unsigned long action, - int online_type) + int online_type, int nid) { unsigned long start_pfn; unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; @@ -238,7 +219,7 @@ memory_block_action(unsigned long start_section_nr, unsigned long action, if (!pages_correctly_probed(start_pfn)) return -EBUSY; - ret = online_pages(start_pfn, nr_pages, online_type); + ret = online_pages(start_pfn, nr_pages, online_type, nid); break; case MEM_OFFLINE: ret = offline_pages(start_pfn, nr_pages); @@ -264,7 +245,7 @@ static int memory_block_change_state(struct memory_block *mem, mem->state = MEM_GOING_OFFLINE; ret = memory_block_action(mem->start_section_nr, to_state, - mem->online_type); + mem->online_type, mem->nid); mem->state = ret ? from_state_req : to_state; diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 4285e75e52c3..1bdb5793842b 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -207,14 +207,17 @@ static inline void zram_fill_page(void *ptr, unsigned long len, static bool page_same_filled(void *ptr, unsigned long *element) { - unsigned int pos; unsigned long *page; unsigned long val; + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1; page = (unsigned long *)ptr; val = page[0]; - for (pos = 1; pos < PAGE_SIZE / sizeof(*page); pos++) { + if (val != page[last_pos]) + return false; + + for (pos = 1; pos < last_pos; pos++) { if (val != page[pos]) return false; } @@ -626,7 +629,7 @@ static ssize_t writeback_store(struct device *dev, struct bio bio; struct bio_vec bio_vec; struct page *page; - ssize_t ret; + ssize_t ret = len; int mode; unsigned long blk_idx = 0; @@ -762,7 +765,6 @@ next: if (blk_idx) free_block_bdev(zram, blk_idx); - ret = len; __free_page(page); release_init_lock: up_read(&zram->init_lock); diff --git a/drivers/gpu/drm/via/via_dmablit.c b/drivers/gpu/drm/via/via_dmablit.c index d13a3897506e..551fa31629af 100644 --- a/drivers/gpu/drm/via/via_dmablit.c +++ b/drivers/gpu/drm/via/via_dmablit.c @@ -188,8 +188,8 @@ via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t *vsg) kfree(vsg->desc_pages); /* fall through */ case dr_via_pages_locked: - put_user_pages_dirty_lock(vsg->pages, vsg->num_pages, - (vsg->direction == DMA_FROM_DEVICE)); + unpin_user_pages_dirty_lock(vsg->pages, vsg->num_pages, + (vsg->direction == DMA_FROM_DEVICE)); /* fall through */ case dr_via_pages_alloc: vfree(vsg->pages); @@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg, drm_via_dmablit_t *xfer) vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages)); if (NULL == vsg->pages) return -ENOMEM; - ret = get_user_pages_fast((unsigned long)xfer->mem_addr, + ret = pin_user_pages_fast((unsigned long)xfer->mem_addr, vsg->num_pages, vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0, vsg->pages); diff --git a/drivers/iio/adc/qcom-vadc-common.c b/drivers/iio/adc/qcom-vadc-common.c index dcd7fb5b9fb2..2bb78d1c4daa 100644 --- a/drivers/iio/adc/qcom-vadc-common.c +++ b/drivers/iio/adc/qcom-vadc-common.c @@ -6,6 +6,7 @@ #include <linux/log2.h> #include <linux/err.h> #include <linux/module.h> +#include <linux/units.h> #include "qcom-vadc-common.h" @@ -236,8 +237,7 @@ static int qcom_vadc_scale_die_temp(const struct vadc_linear_graph *calib_graph, voltage = 0; } - voltage -= KELVINMIL_CELSIUSMIL; - *result_mdec = voltage; + *result_mdec = milli_kelvin_to_millicelsius(voltage); return 0; } @@ -325,7 +325,7 @@ static int qcom_vadc_scale_hw_calib_die_temp( { *result_mdec = qcom_vadc_scale_code_voltage_factor(adc_code, prescale, data, 2); - *result_mdec -= KELVINMIL_CELSIUSMIL; + *result_mdec = milli_kelvin_to_millicelsius(*result_mdec); return 0; } diff --git a/drivers/iio/adc/qcom-vadc-common.h b/drivers/iio/adc/qcom-vadc-common.h index bbb1fa02b382..e074902a24cc 100644 --- a/drivers/iio/adc/qcom-vadc-common.h +++ b/drivers/iio/adc/qcom-vadc-common.h @@ -38,7 +38,6 @@ #define VADC_AVG_SAMPLES_MAX 512 #define ADC5_AVG_SAMPLES_MAX 16 -#define KELVINMIL_CELSIUSMIL 273150 #define PMIC5_CHG_TEMP_SCALE_FACTOR 377500 #define PMIC5_SMB_TEMP_CONSTANT 419400 #define PMIC5_SMB_TEMP_SCALE_FACTOR 356 diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 146f98fbf22b..c3769a5f096d 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -54,7 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) { page = sg_page_iter_page(&sg_iter); - put_user_pages_dirty_lock(&page, 1, umem->writable && dirty); + unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty); } sg_free_table(&umem->sg_head); @@ -257,16 +257,13 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr, sg = umem->sg_head.sgl; while (npages) { - down_read(&mm->mmap_sem); - ret = get_user_pages(cur_base, - min_t(unsigned long, npages, - PAGE_SIZE / sizeof (struct page *)), - gup_flags | FOLL_LONGTERM, - page_list, NULL); - if (ret < 0) { - up_read(&mm->mmap_sem); + ret = pin_user_pages_fast(cur_base, + min_t(unsigned long, npages, + PAGE_SIZE / + sizeof(struct page *)), + gup_flags | FOLL_LONGTERM, page_list); + if (ret < 0) goto umem_release; - } cur_base += ret * PAGE_SIZE; npages -= ret; @@ -274,8 +271,6 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr, sg = ib_umem_add_sg_table(sg, page_list, ret, dma_get_max_seg_size(device->dma_device), &umem->sg_nents); - - up_read(&mm->mmap_sem); } sg_mark_end(sg); diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index dac3fd2ebc26..a71ce0ae2031 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -293,9 +293,8 @@ EXPORT_SYMBOL(ib_umem_odp_release); * The function returns -EFAULT if the DMA mapping operation fails. It returns * -EAGAIN if a concurrent invalidation prevents us from updating the page. * - * The page is released via put_user_page even if the operation failed. For - * on-demand pinning, the page is released whenever it isn't stored in the - * umem. + * The page is released via put_page even if the operation failed. For on-demand + * pinning, the page is released whenever it isn't stored in the umem. */ static int ib_umem_odp_map_dma_single_page( struct ib_umem_odp *umem_odp, @@ -348,7 +347,7 @@ static int ib_umem_odp_map_dma_single_page( } out: - put_user_page(page); + put_page(page); return ret; } @@ -458,7 +457,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt, ret = -EFAULT; break; } - put_user_page(local_page_list[j]); + put_page(local_page_list[j]); continue; } @@ -485,8 +484,8 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt, * ib_umem_odp_map_dma_single_page(). */ if (npages - (j + 1) > 0) - put_user_pages(&local_page_list[j+1], - npages - (j + 1)); + release_pages(&local_page_list[j+1], + npages - (j + 1)); break; } } diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c index 469acb961fbd..3b505006c0a6 100644 --- a/drivers/infiniband/hw/hfi1/user_pages.c +++ b/drivers/infiniband/hw/hfi1/user_pages.c @@ -106,7 +106,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np int ret; unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0); - ret = get_user_pages_fast(vaddr, npages, gup_flags, pages); + ret = pin_user_pages_fast(vaddr, npages, gup_flags, pages); if (ret < 0) return ret; @@ -118,7 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np void hfi1_release_user_pages(struct mm_struct *mm, struct page **p, size_t npages, bool dirty) { - put_user_pages_dirty_lock(p, npages, dirty); + unpin_user_pages_dirty_lock(p, npages, dirty); if (mm) { /* during close after signal, mm can be NULL */ atomic64_sub(npages, &mm->pinned_vm); diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c index edccfd6e178f..78a48aea3faf 100644 --- a/drivers/infiniband/hw/mthca/mthca_memfree.c +++ b/drivers/infiniband/hw/mthca/mthca_memfree.c @@ -472,7 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar, goto out; } - ret = get_user_pages_fast(uaddr & PAGE_MASK, 1, + ret = pin_user_pages_fast(uaddr & PAGE_MASK, 1, FOLL_WRITE | FOLL_LONGTERM, pages); if (ret < 0) goto out; @@ -482,7 +482,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar, ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); if (ret < 0) { - put_user_page(pages[0]); + unpin_user_page(pages[0]); goto out; } @@ -490,7 +490,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar, mthca_uarc_virt(dev, uar, i)); if (ret) { pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); - put_user_page(sg_page(&db_tab->page[i].mem)); + unpin_user_page(sg_page(&db_tab->page[i].mem)); goto out; } @@ -556,7 +556,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar, if (db_tab->page[i].uvirt) { mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1); pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE); - put_user_page(sg_page(&db_tab->page[i].mem)); + unpin_user_page(sg_page(&db_tab->page[i].mem)); } } diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c index 6bf764e41891..342e3172ca40 100644 --- a/drivers/infiniband/hw/qib/qib_user_pages.c +++ b/drivers/infiniband/hw/qib/qib_user_pages.c @@ -40,7 +40,7 @@ static void __qib_release_user_pages(struct page **p, size_t num_pages, int dirty) { - put_user_pages_dirty_lock(p, num_pages, dirty); + unpin_user_pages_dirty_lock(p, num_pages, dirty); } /** @@ -108,7 +108,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages, down_read(¤t->mm->mmap_sem); for (got = 0; got < num_pages; got += ret) { - ret = get_user_pages(start_page + got * PAGE_SIZE, + ret = pin_user_pages(start_page + got * PAGE_SIZE, num_pages - got, FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE, p + got, NULL); diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c index 05190edc2611..a67599b5a550 100644 --- a/drivers/infiniband/hw/qib/qib_user_sdma.c +++ b/drivers/infiniband/hw/qib/qib_user_sdma.c @@ -317,7 +317,7 @@ static int qib_user_sdma_page_to_frags(const struct qib_devdata *dd, * the caller can ignore this page. */ if (put) { - put_user_page(page); + unpin_user_page(page); } else { /* coalesce case */ kunmap(page); @@ -631,7 +631,7 @@ static void qib_user_sdma_free_pkt_frag(struct device *dev, kunmap(pkt->addr[i].page); if (pkt->addr[i].put_page) - put_user_page(pkt->addr[i].page); + unpin_user_page(pkt->addr[i].page); else __free_page(pkt->addr[i].page); } else if (pkt->addr[i].kvaddr) { @@ -670,7 +670,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd, else j = npages; - ret = get_user_pages_fast(addr, j, FOLL_LONGTERM, pages); + ret = pin_user_pages_fast(addr, j, FOLL_LONGTERM, pages); if (ret != j) { i = 0; j = ret; @@ -706,7 +706,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd, /* if error, return all pages not managed by pkt */ free_pages: while (i < j) - put_user_page(pages[i++]); + unpin_user_page(pages[i++]); done: return ret; diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c index 62e6ffa9ad78..bd9f944b68fc 100644 --- a/drivers/infiniband/hw/usnic/usnic_uiom.c +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c @@ -75,7 +75,7 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty) for_each_sg(chunk->page_list, sg, chunk->nents, i) { page = sg_page(sg); pa = sg_phys(sg); - put_user_pages_dirty_lock(&page, 1, dirty); + unpin_user_pages_dirty_lock(&page, 1, dirty); usnic_dbg("pa: %pa\n", &pa); } kfree(chunk); @@ -141,7 +141,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable, ret = 0; while (npages) { - ret = get_user_pages(cur_base, + ret = pin_user_pages(cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof(struct page *)), gup_flags | FOLL_LONGTERM, diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c index e99983f07663..e2061dc0b043 100644 --- a/drivers/infiniband/sw/siw/siw_mem.c +++ b/drivers/infiniband/sw/siw/siw_mem.c @@ -63,7 +63,7 @@ struct siw_mem *siw_mem_id2obj(struct siw_device *sdev, int stag_index) static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages, bool dirty) { - put_user_pages_dirty_lock(chunk->plist, num_pages, dirty); + unpin_user_pages_dirty_lock(chunk->plist, num_pages, dirty); } void siw_umem_release(struct siw_umem *umem, bool dirty) @@ -426,7 +426,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable) while (nents) { struct page **plist = &umem->page_chunk[i].plist[got]; - rv = get_user_pages(first_page_va, nents, + rv = pin_user_pages(first_page_va, nents, foll_flags | FOLL_LONGTERM, plist, NULL); if (rv < 0) diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index 66a6c6c236a7..13b65ed9e74c 100644 --- a/drivers/media/v4l2-core/videobuf-dma-sg.c +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c @@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma, dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n", data, size, dma->nr_pages); - err = get_user_pages(data & PAGE_MASK, dma->nr_pages, + err = pin_user_pages(data & PAGE_MASK, dma->nr_pages, flags | FOLL_LONGTERM, dma->pages, NULL); if (err != dma->nr_pages) { dma->nr_pages = (err >= 0) ? err : 0; - dprintk(1, "get_user_pages: err=%d [%d]\n", err, + dprintk(1, "pin_user_pages: err=%d [%d]\n", err, dma->nr_pages); return err < 0 ? err : -EINVAL; } @@ -349,8 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma) BUG_ON(dma->sglen); if (dma->pages) { - for (i = 0; i < dma->nr_pages; i++) - put_page(dma->pages[i]); + unpin_user_pages_dirty_lock(dma->pages, dma->nr_pages, + dma->direction == DMA_FROM_DEVICE); kfree(dma->pages); dma->pages = NULL; } diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h index 066765fbef06..0a59a09ef82f 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h @@ -296,7 +296,6 @@ static inline void bnx2x_dcb_config_qm(struct bnx2x *bp, enum cos_mode mode, * possible, the driver should only write the valid vnics into the internal * ram according to the appropriate port mode. */ -#define BITS_TO_BYTES(x) ((x)/8) /* CMNG constants, as derived from system spec calculations */ diff --git a/drivers/net/wireless/intel/iwlegacy/4965-mac.c b/drivers/net/wireless/intel/iwlegacy/4965-mac.c index d1e17589dbeb..da6d4202611c 100644 --- a/drivers/net/wireless/intel/iwlegacy/4965-mac.c +++ b/drivers/net/wireless/intel/iwlegacy/4965-mac.c @@ -27,6 +27,7 @@ #include <linux/firmware.h> #include <linux/etherdevice.h> #include <linux/if_arp.h> +#include <linux/units.h> #include <net/mac80211.h> @@ -6468,7 +6469,7 @@ il4965_set_hw_params(struct il_priv *il) il->hw_params.valid_rx_ant = il->cfg->valid_rx_ant; il->hw_params.ct_kill_threshold = - CELSIUS_TO_KELVIN(CT_KILL_THRESHOLD_LEGACY); + celsius_to_kelvin(CT_KILL_THRESHOLD_LEGACY); il->hw_params.sens = &il4965_sensitivity; il->hw_params.beacon_time_tsf_bits = IL4965_EXT_BEACON_TIME_POS; diff --git a/drivers/net/wireless/intel/iwlegacy/4965.c b/drivers/net/wireless/intel/iwlegacy/4965.c index 32699b6a68c2..34d0579132ce 100644 --- a/drivers/net/wireless/intel/iwlegacy/4965.c +++ b/drivers/net/wireless/intel/iwlegacy/4965.c @@ -17,6 +17,7 @@ #include <linux/sched.h> #include <linux/skbuff.h> #include <linux/netdevice.h> +#include <linux/units.h> #include <net/mac80211.h> #include <linux/etherdevice.h> #include <asm/unaligned.h> @@ -1104,7 +1105,7 @@ il4965_fill_txpower_tbl(struct il_priv *il, u8 band, u16 channel, u8 is_ht40, /* get current temperature (Celsius) */ current_temp = max(il->temperature, IL_TX_POWER_TEMPERATURE_MIN); current_temp = min(il->temperature, IL_TX_POWER_TEMPERATURE_MAX); - current_temp = KELVIN_TO_CELSIUS(current_temp); + current_temp = kelvin_to_celsius(current_temp); /* select thermal txpower adjustment params, based on channel group * (same frequency group used for mimo txatten adjustment) */ @@ -1610,8 +1611,8 @@ il4965_hw_get_temperature(struct il_priv *il) temperature = (temperature * 97) / 100 + TEMPERATURE_CALIB_KELVIN_OFFSET; - D_TEMP("Calibrated temperature: %dK, %dC\n", temperature, - KELVIN_TO_CELSIUS(temperature)); + D_TEMP("Calibrated temperature: %dK, %ldC\n", temperature, + kelvin_to_celsius(temperature)); return temperature; } @@ -1670,12 +1671,12 @@ il4965_temperature_calib(struct il_priv *il) if (il->temperature != temp) { if (il->temperature) - D_TEMP("Temperature changed " "from %dC to %dC\n", - KELVIN_TO_CELSIUS(il->temperature), - KELVIN_TO_CELSIUS(temp)); + D_TEMP("Temperature changed " "from %ldC to %ldC\n", + kelvin_to_celsius(il->temperature), + kelvin_to_celsius(temp)); else - D_TEMP("Temperature " "initialized to %dC\n", - KELVIN_TO_CELSIUS(temp)); + D_TEMP("Temperature " "initialized to %ldC\n", + kelvin_to_celsius(temp)); } il->temperature = temp; diff --git a/drivers/net/wireless/intel/iwlegacy/common.h b/drivers/net/wireless/intel/iwlegacy/common.h index e7fb8e6bb9e7..bc9cd7e5ccb8 100644 --- a/drivers/net/wireless/intel/iwlegacy/common.h +++ b/drivers/net/wireless/intel/iwlegacy/common.h @@ -779,9 +779,6 @@ struct il_sensitivity_ranges { u16 nrg_th_cca; }; -#define KELVIN_TO_CELSIUS(x) ((x)-273) -#define CELSIUS_TO_KELVIN(x) ((x)+273) - /** * struct il_hw_params * @bcast_id: f/w broadcast station ID diff --git a/drivers/net/wireless/intel/iwlwifi/dvm/dev.h b/drivers/net/wireless/intel/iwlwifi/dvm/dev.h index be5ef4c3e9d0..8d8380026180 100644 --- a/drivers/net/wireless/intel/iwlwifi/dvm/dev.h +++ b/drivers/net/wireless/intel/iwlwifi/dvm/dev.h @@ -237,11 +237,6 @@ struct iwl_sensitivity_ranges { u16 nrg_th_cca; }; - -#define KELVIN_TO_CELSIUS(x) ((x)-273) -#define CELSIUS_TO_KELVIN(x) ((x)+273) - - /****************************************************************************** * * Functions implemented in core module which are forward declared here diff --git a/drivers/net/wireless/intel/iwlwifi/dvm/devices.c b/drivers/net/wireless/intel/iwlwifi/dvm/devices.c index dc3f197f94d9..d42bc46fe566 100644 --- a/drivers/net/wireless/intel/iwlwifi/dvm/devices.c +++ b/drivers/net/wireless/intel/iwlwifi/dvm/devices.c @@ -10,6 +10,8 @@ * *****************************************************************************/ +#include <linux/units.h> + /* * DVM device-specific data & functions */ @@ -345,7 +347,7 @@ static s32 iwl_temp_calib_to_offset(struct iwl_priv *priv) static void iwl5150_set_ct_threshold(struct iwl_priv *priv) { const s32 volt2temp_coef = IWL_5150_VOLTAGE_TO_TEMPERATURE_COEFF; - s32 threshold = (s32)CELSIUS_TO_KELVIN(CT_KILL_THRESHOLD_LEGACY) - + s32 threshold = (s32)celsius_to_kelvin(CT_KILL_THRESHOLD_LEGACY) - iwl_temp_calib_to_offset(priv); priv->hw_params.ct_kill_threshold = threshold * volt2temp_coef; @@ -381,7 +383,7 @@ static void iwl5150_temperature(struct iwl_priv *priv) vt = le32_to_cpu(priv->statistics.common.temperature); vt = vt / IWL_5150_VOLTAGE_TO_TEMPERATURE_COEFF + offset; /* now vt hold the temperature in Kelvin */ - priv->temperature = KELVIN_TO_CELSIUS(vt); + priv->temperature = kelvin_to_celsius(vt); iwl_tt_handler(priv); } diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index ad8e4df1282b..4eae441f86c9 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem) put_disk(pmem->disk); } -static void pmem_pagemap_page_free(struct page *page) -{ - wake_up_var(&page->_refcount); -} - static const struct dev_pagemap_ops fsdax_pagemap_ops = { - .page_free = pmem_pagemap_page_free, .kill = pmem_pagemap_kill, .cleanup = pmem_pagemap_cleanup, }; diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c index a5af21f5d370..2e6477ed420f 100644 --- a/drivers/nvme/host/hwmon.c +++ b/drivers/nvme/host/hwmon.c @@ -5,14 +5,11 @@ */ #include <linux/hwmon.h> +#include <linux/units.h> #include <asm/unaligned.h> #include "nvme.h" -/* These macros should be moved to linux/temperature.h */ -#define MILLICELSIUS_TO_KELVIN(t) DIV_ROUND_CLOSEST((t) + 273150, 1000) -#define KELVIN_TO_MILLICELSIUS(t) ((t) * 1000L - 273150) - struct nvme_hwmon_data { struct nvme_ctrl *ctrl; struct nvme_smart_log log; @@ -35,7 +32,7 @@ static int nvme_get_temp_thresh(struct nvme_ctrl *ctrl, int sensor, bool under, return -EIO; if (ret < 0) return ret; - *temp = KELVIN_TO_MILLICELSIUS(status & NVME_TEMP_THRESH_MASK); + *temp = kelvin_to_millicelsius(status & NVME_TEMP_THRESH_MASK); return 0; } @@ -46,7 +43,7 @@ static int nvme_set_temp_thresh(struct nvme_ctrl *ctrl, int sensor, bool under, unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT; int ret; - temp = MILLICELSIUS_TO_KELVIN(temp); + temp = millicelsius_to_kelvin(temp); threshold |= clamp_val(temp, 0, NVME_TEMP_THRESH_MASK); if (under) @@ -88,7 +85,7 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, case hwmon_temp_min: return nvme_get_temp_thresh(data->ctrl, channel, true, val); case hwmon_temp_crit: - *val = KELVIN_TO_MILLICELSIUS(data->ctrl->cctemp); + *val = kelvin_to_millicelsius(data->ctrl->cctemp); return 0; default: break; @@ -105,7 +102,7 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, temp = get_unaligned_le16(log->temperature); else temp = le16_to_cpu(log->temp_sensor[channel - 1]); - *val = KELVIN_TO_MILLICELSIUS(temp); + *val = kelvin_to_millicelsius(temp); break; case hwmon_temp_alarm: *val = !!(log->critical_warning & NVME_SMART_CRIT_TEMPERATURE); diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c index cef0133aa47a..1ab207ec9c94 100644 --- a/drivers/platform/goldfish/goldfish_pipe.c +++ b/drivers/platform/goldfish/goldfish_pipe.c @@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status) } } -static int pin_user_pages(unsigned long first_page, - unsigned long last_page, - unsigned int last_page_size, - int is_write, - struct page *pages[MAX_BUFFERS_PER_COMMAND], - unsigned int *iter_last_page_size) +static int goldfish_pin_pages(unsigned long first_page, + unsigned long last_page, + unsigned int last_page_size, + int is_write, + struct page *pages[MAX_BUFFERS_PER_COMMAND], + unsigned int *iter_last_page_size) { int ret; int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1; @@ -274,7 +274,7 @@ static int pin_user_pages(unsigned long first_page, *iter_last_page_size = last_page_size; } - ret = get_user_pages_fast(first_page, requested_pages, + ret = pin_user_pages_fast(first_page, requested_pages, !is_write ? FOLL_WRITE : 0, pages); if (ret <= 0) @@ -285,18 +285,6 @@ static int pin_user_pages(unsigned long first_page, return ret; } -static void release_user_pages(struct page **pages, int pages_count, - int is_write, s32 consumed_size) -{ - int i; - - for (i = 0; i < pages_count; i++) { - if (!is_write && consumed_size > 0) - set_page_dirty(pages[i]); - put_page(pages[i]); - } -} - /* Populate the call parameters, merging adjacent pages together */ static void populate_rw_params(struct page **pages, int pages_count, @@ -354,9 +342,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe, if (mutex_lock_interruptible(&pipe->lock)) return -ERESTARTSYS; - pages_count = pin_user_pages(first_page, last_page, - last_page_size, is_write, - pipe->pages, &iter_last_page_size); + pages_count = goldfish_pin_pages(first_page, last_page, + last_page_size, is_write, + pipe->pages, &iter_last_page_size); if (pages_count < 0) { mutex_unlock(&pipe->lock); return pages_count; @@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe, *consumed_size = pipe->command_buffer->rw_params.consumed_size; - release_user_pages(pipe->pages, pages_count, is_write, *consumed_size); + unpin_user_pages_dirty_lock(pipe->pages, pages_count, + !is_write && *consumed_size > 0); mutex_unlock(&pipe->lock); return 0; diff --git a/drivers/platform/x86/asus-wmi.c b/drivers/platform/x86/asus-wmi.c index 43bb15e05529..612ef5526226 100644 --- a/drivers/platform/x86/asus-wmi.c +++ b/drivers/platform/x86/asus-wmi.c @@ -33,9 +33,9 @@ #include <linux/seq_file.h> #include <linux/platform_data/x86/asus-wmi.h> #include <linux/platform_device.h> -#include <linux/thermal.h> #include <linux/acpi.h> #include <linux/dmi.h> +#include <linux/units.h> #include <acpi/battery.h> #include <acpi/video.h> @@ -1514,9 +1514,8 @@ static ssize_t asus_hwmon_temp1(struct device *dev, if (err < 0) return err; - value = DECI_KELVIN_TO_CELSIUS((value & 0xFFFF)) * 1000; - - return sprintf(buf, "%d\n", value); + return sprintf(buf, "%ld\n", + deci_kelvin_to_millicelsius(value & 0xFFFF)); } /* Fan1 */ diff --git a/drivers/platform/x86/intel_menlow.c b/drivers/platform/x86/intel_menlow.c index b102f6dd5693..101d7e791a13 100644 --- a/drivers/platform/x86/intel_menlow.c +++ b/drivers/platform/x86/intel_menlow.c @@ -22,6 +22,7 @@ #include <linux/slab.h> #include <linux/thermal.h> #include <linux/types.h> +#include <linux/units.h> MODULE_AUTHOR("Thomas Sujith"); MODULE_AUTHOR("Zhang Rui"); @@ -302,8 +303,10 @@ static ssize_t aux_show(struct device *dev, struct device_attribute *dev_attr, int result; result = sensor_get_auxtrip(attr->handle, idx, &value); + if (result) + return result; - return result ? result : sprintf(buf, "%lu", DECI_KELVIN_TO_CELSIUS(value)); + return sprintf(buf, "%lu", deci_kelvin_to_celsius(value)); } static ssize_t aux0_show(struct device *dev, @@ -332,8 +335,8 @@ static ssize_t aux_store(struct device *dev, struct device_attribute *dev_attr, if (value < 0) return -EINVAL; - result = sensor_set_auxtrip(attr->handle, idx, - CELSIUS_TO_DECI_KELVIN(value)); + result = sensor_set_auxtrip(attr->handle, idx, + celsius_to_deci_kelvin(value)); return result ? result : count; } diff --git a/drivers/thermal/armada_thermal.c b/drivers/thermal/armada_thermal.c index 8c4d1244ee7a..7c447cd149e7 100644 --- a/drivers/thermal/armada_thermal.c +++ b/drivers/thermal/armada_thermal.c @@ -21,8 +21,6 @@ #include "thermal_core.h" -#define TO_MCELSIUS(c) ((c) * 1000) - /* Thermal Manager Control and Status Register */ #define PMU_TDC0_SW_RST_MASK (0x1 << 1) #define PMU_TM_DISABLE_OFFS 0 diff --git a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c index 75484d6c5056..432213272f1e 100644 --- a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c +++ b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c @@ -8,6 +8,7 @@ #include <linux/init.h> #include <linux/acpi.h> #include <linux/thermal.h> +#include <linux/units.h> #include "int340x_thermal_zone.h" static int int340x_thermal_get_zone_temp(struct thermal_zone_device *zone, @@ -34,7 +35,7 @@ static int int340x_thermal_get_zone_temp(struct thermal_zone_device *zone, *temp = (unsigned long)conv_temp * 10; } else /* _TMP returns the temperature in tenths of degrees Kelvin */ - *temp = DECI_KELVIN_TO_MILLICELSIUS(tmp); + *temp = deci_kelvin_to_millicelsius(tmp); return 0; } @@ -116,7 +117,7 @@ static int int340x_thermal_set_trip_temp(struct thermal_zone_device *zone, snprintf(name, sizeof(name), "PAT%d", trip); status = acpi_execute_simple_method(d->adev->handle, name, - MILLICELSIUS_TO_DECI_KELVIN(temp)); + millicelsius_to_deci_kelvin(temp)); if (ACPI_FAILURE(status)) return -EIO; @@ -163,7 +164,7 @@ static int int340x_thermal_get_trip_config(acpi_handle handle, char *name, if (ACPI_FAILURE(status)) return -EIO; - *temp = DECI_KELVIN_TO_MILLICELSIUS(r); + *temp = deci_kelvin_to_millicelsius(r); return 0; } diff --git a/drivers/thermal/intel/intel_pch_thermal.c b/drivers/thermal/intel/intel_pch_thermal.c index ed75a0c603e7..56401fd4708d 100644 --- a/drivers/thermal/intel/intel_pch_thermal.c +++ b/drivers/thermal/intel/intel_pch_thermal.c @@ -13,6 +13,7 @@ #include <linux/pci.h> #include <linux/acpi.h> #include <linux/thermal.h> +#include <linux/units.h> #include <linux/pm.h> /* Intel PCH thermal Device IDs */ @@ -93,7 +94,7 @@ static void pch_wpt_add_acpi_psv_trip(struct pch_thermal_device *ptd, if (ACPI_SUCCESS(status)) { unsigned long trip_temp; - trip_temp = DECI_KELVIN_TO_MILLICELSIUS(r); + trip_temp = deci_kelvin_to_millicelsius(r); if (trip_temp) { ptd->psv_temp = trip_temp; ptd->psv_trip_id = *nr_trips; diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 2ada8e6cdb88..a177bf2c6683 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -309,9 +309,8 @@ static int put_pfn(unsigned long pfn, int prot) { if (!is_invalid_reserved_pfn(pfn)) { struct page *page = pfn_to_page(pfn); - if (prot & IOMMU_WRITE) - SetPageDirty(page); - put_page(page); + + unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE); return 1; } return 0; @@ -322,7 +321,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, { struct page *page[1]; struct vm_area_struct *vma; - struct vm_area_struct *vmas[1]; unsigned int flags = 0; int ret; @@ -330,33 +328,14 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, flags |= FOLL_WRITE; down_read(&mm->mmap_sem); - if (mm == current->mm) { - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page, - vmas); - } else { - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page, - vmas, NULL); - /* - * The lifetime of a vaddr_get_pfn() page pin is - * userspace-controlled. In the fs-dax case this could - * lead to indefinite stalls in filesystem operations. - * Disallow attempts to pin fs-dax pages via this - * interface. - */ - if (ret > 0 && vma_is_fsdax(vmas[0])) { - ret = -EOPNOTSUPP; - put_page(page[0]); - } - } - up_read(&mm->mmap_sem); - + ret = pin_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM, + page, NULL, NULL); if (ret == 1) { *pfn = page_to_pfn(page[0]); - return 0; + ret = 0; + goto done; } - down_read(&mm->mmap_sem); - vaddr = untagged_addr(vaddr); vma = find_vma_intersection(mm, vaddr, vaddr + 1); @@ -366,7 +345,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, if (is_invalid_reserved_pfn(*pfn)) ret = 0; } - +done: up_read(&mm->mmap_sem); return ret; } diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index ecd8d2698515..f4713ea76e82 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -97,7 +97,7 @@ static struct linux_binfmt elf_format = { .min_coredump = ELF_EXEC_PAGESIZE, }; -#define BAD_ADDR(x) ((unsigned long)(x) >= TASK_SIZE) +#define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE)) static int set_brk(unsigned long start, unsigned long end, int prot) { @@ -161,9 +161,11 @@ static int padzero(unsigned long elf_bss) #endif static int -create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, - unsigned long load_addr, unsigned long interp_load_addr) +create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec, + unsigned long load_addr, unsigned long interp_load_addr, + unsigned long e_entry) { + struct mm_struct *mm = current->mm; unsigned long p = bprm->p; int argc = bprm->argc; int envc = bprm->envc; @@ -176,7 +178,7 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, unsigned char k_rand_bytes[16]; int items; elf_addr_t *elf_info; - int ei_index = 0; + int ei_index; const struct cred *cred = current_cred(); struct vm_area_struct *vma; @@ -226,12 +228,12 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, return -EFAULT; /* Create the ELF interpreter info */ - elf_info = (elf_addr_t *)current->mm->saved_auxv; + elf_info = (elf_addr_t *)mm->saved_auxv; /* update AT_VECTOR_SIZE_BASE if the number of NEW_AUX_ENT() changes */ #define NEW_AUX_ENT(id, val) \ do { \ - elf_info[ei_index++] = id; \ - elf_info[ei_index++] = val; \ + *elf_info++ = id; \ + *elf_info++ = val; \ } while (0) #ifdef ARCH_DLINFO @@ -251,7 +253,7 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, NEW_AUX_ENT(AT_PHNUM, exec->e_phnum); NEW_AUX_ENT(AT_BASE, interp_load_addr); NEW_AUX_ENT(AT_FLAGS, 0); - NEW_AUX_ENT(AT_ENTRY, exec->e_entry); + NEW_AUX_ENT(AT_ENTRY, e_entry); NEW_AUX_ENT(AT_UID, from_kuid_munged(cred->user_ns, cred->uid)); NEW_AUX_ENT(AT_EUID, from_kuid_munged(cred->user_ns, cred->euid)); NEW_AUX_ENT(AT_GID, from_kgid_munged(cred->user_ns, cred->gid)); @@ -275,12 +277,13 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, } #undef NEW_AUX_ENT /* AT_NULL is zero; clear the rest too */ - memset(&elf_info[ei_index], 0, - sizeof current->mm->saved_auxv - ei_index * sizeof elf_info[0]); + memset(elf_info, 0, (char *)mm->saved_auxv + + sizeof(mm->saved_auxv) - (char *)elf_info); /* And advance past the AT_NULL entry. */ - ei_index += 2; + elf_info += 2; + ei_index = elf_info - (elf_addr_t *)mm->saved_auxv; sp = STACK_ADD(p, ei_index); items = (argc + 1) + (envc + 1) + 1; @@ -299,7 +302,7 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, * Grow the stack manually; some architectures have a limit on how * far ahead a user-space access may be in order to grow the stack. */ - vma = find_extend_vma(current->mm, bprm->p); + vma = find_extend_vma(mm, bprm->p); if (!vma) return -EFAULT; @@ -308,7 +311,7 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, return -EFAULT; /* Populate list of argv pointers back to argv strings. */ - p = current->mm->arg_end = current->mm->arg_start; + p = mm->arg_end = mm->arg_start; while (argc-- > 0) { size_t len; if (__put_user((elf_addr_t)p, sp++)) @@ -320,10 +323,10 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, } if (__put_user(0, sp++)) return -EFAULT; - current->mm->arg_end = p; + mm->arg_end = p; /* Populate list of envp pointers back to envp strings. */ - current->mm->env_end = current->mm->env_start = p; + mm->env_end = mm->env_start = p; while (envc-- > 0) { size_t len; if (__put_user((elf_addr_t)p, sp++)) @@ -335,10 +338,10 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, } if (__put_user(0, sp++)) return -EFAULT; - current->mm->env_end = p; + mm->env_end = p; /* Put the elf_info on the stack in the right place. */ - if (copy_to_user(sp, elf_info, ei_index * sizeof(elf_addr_t))) + if (copy_to_user(sp, mm->saved_auxv, ei_index * sizeof(elf_addr_t))) return -EFAULT; return 0; } @@ -689,15 +692,17 @@ static int load_elf_binary(struct linux_binprm *bprm) int bss_prot = 0; int retval, i; unsigned long elf_entry; + unsigned long e_entry; unsigned long interp_load_addr = 0; unsigned long start_code, end_code, start_data, end_data; unsigned long reloc_func_desc __maybe_unused = 0; int executable_stack = EXSTACK_DEFAULT; + struct elfhdr *elf_ex = (struct elfhdr *)bprm->buf; struct { - struct elfhdr elf_ex; struct elfhdr interp_elf_ex; } *loc; struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE; + struct mm_struct *mm; struct pt_regs *regs; loc = kmalloc(sizeof(*loc), GFP_KERNEL); @@ -705,30 +710,27 @@ static int load_elf_binary(struct linux_binprm *bprm) retval = -ENOMEM; goto out_ret; } - - /* Get the exec-header */ - loc->elf_ex = *((struct elfhdr *)bprm->buf); retval = -ENOEXEC; /* First of all, some simple consistency checks */ - if (memcmp(loc->elf_ex.e_ident, ELFMAG, SELFMAG) != 0) + if (memcmp(elf_ex->e_ident, ELFMAG, SELFMAG) != 0) goto out; - if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN) + if (elf_ex->e_type != ET_EXEC && elf_ex->e_type != ET_DYN) goto out; - if (!elf_check_arch(&loc->elf_ex)) + if (!elf_check_arch(elf_ex)) goto out; - if (elf_check_fdpic(&loc->elf_ex)) + if (elf_check_fdpic(elf_ex)) goto out; if (!bprm->file->f_op->mmap) goto out; - elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file); + elf_phdata = load_elf_phdrs(elf_ex, bprm->file); if (!elf_phdata) goto out; elf_ppnt = elf_phdata; - for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++) { + for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) { char *elf_interpreter; if (elf_ppnt->p_type != PT_INTERP) @@ -782,7 +784,7 @@ out_free_interp: } elf_ppnt = elf_phdata; - for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++) + for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) switch (elf_ppnt->p_type) { case PT_GNU_STACK: if (elf_ppnt->p_flags & PF_X) @@ -792,7 +794,7 @@ out_free_interp: break; case PT_LOPROC ... PT_HIPROC: - retval = arch_elf_pt_proc(&loc->elf_ex, elf_ppnt, + retval = arch_elf_pt_proc(elf_ex, elf_ppnt, bprm->file, false, &arch_state); if (retval) @@ -836,7 +838,7 @@ out_free_interp: * still possible to return an error to the code that invoked * the exec syscall. */ - retval = arch_check_elf(&loc->elf_ex, + retval = arch_check_elf(elf_ex, !!interpreter, &loc->interp_elf_ex, &arch_state); if (retval) @@ -849,8 +851,8 @@ out_free_interp: /* Do this immediately, since STACK_TOP as used in setup_arg_pages may depend on the personality. */ - SET_PERSONALITY2(loc->elf_ex, &arch_state); - if (elf_read_implies_exec(loc->elf_ex, executable_stack)) + SET_PERSONALITY2(*elf_ex, &arch_state); + if (elf_read_implies_exec(*elf_ex, executable_stack)) current->personality |= READ_IMPLIES_EXEC; if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space) @@ -877,7 +879,7 @@ out_free_interp: /* Now we do a little grungy work by mmapping the ELF image into the correct location in memory. */ for(i = 0, elf_ppnt = elf_phdata; - i < loc->elf_ex.e_phnum; i++, elf_ppnt++) { + i < elf_ex->e_phnum; i++, elf_ppnt++) { int elf_prot, elf_flags; unsigned long k, vaddr; unsigned long total_size = 0; @@ -921,9 +923,9 @@ out_free_interp: * If we are loading ET_EXEC or we have already performed * the ET_DYN load_addr calculations, proceed normally. */ - if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) { + if (elf_ex->e_type == ET_EXEC || load_addr_set) { elf_flags |= MAP_FIXED; - } else if (loc->elf_ex.e_type == ET_DYN) { + } else if (elf_ex->e_type == ET_DYN) { /* * This logic is run once for the first LOAD Program * Header for ET_DYN binaries to calculate the @@ -972,7 +974,7 @@ out_free_interp: load_bias = ELF_PAGESTART(load_bias - vaddr); total_size = total_mapping_size(elf_phdata, - loc->elf_ex.e_phnum); + elf_ex->e_phnum); if (!total_size) { retval = -EINVAL; goto out_free_dentry; @@ -990,7 +992,7 @@ out_free_interp: if (!load_addr_set) { load_addr_set = 1; load_addr = (elf_ppnt->p_vaddr - elf_ppnt->p_offset); - if (loc->elf_ex.e_type == ET_DYN) { + if (elf_ex->e_type == ET_DYN) { load_bias += error - ELF_PAGESTART(load_bias + vaddr); load_addr += load_bias; @@ -998,7 +1000,7 @@ out_free_interp: } } k = elf_ppnt->p_vaddr; - if (k < start_code) + if ((elf_ppnt->p_flags & PF_X) && k < start_code) start_code = k; if (start_data < k) start_data = k; @@ -1031,7 +1033,7 @@ out_free_interp: } } - loc->elf_ex.e_entry += load_bias; + e_entry = elf_ex->e_entry + load_bias; elf_bss += load_bias; elf_brk += load_bias; start_code += load_bias; @@ -1074,7 +1076,7 @@ out_free_interp: allow_write_access(interpreter); fput(interpreter); } else { - elf_entry = loc->elf_ex.e_entry; + elf_entry = e_entry; if (BAD_ADDR(elf_entry)) { retval = -EINVAL; goto out_free_dentry; @@ -1092,15 +1094,17 @@ out_free_interp: goto out; #endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */ - retval = create_elf_tables(bprm, &loc->elf_ex, - load_addr, interp_load_addr); + retval = create_elf_tables(bprm, elf_ex, + load_addr, interp_load_addr, e_entry); if (retval < 0) goto out; - current->mm->end_code = end_code; - current->mm->start_code = start_code; - current->mm->start_data = start_data; - current->mm->end_data = end_data; - current->mm->start_stack = bprm->p; + + mm = current->mm; + mm->end_code = end_code; + mm->start_code = start_code; + mm->start_data = start_data; + mm->end_data = end_data; + mm->start_stack = bprm->p; if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) { /* @@ -1111,12 +1115,11 @@ out_free_interp: * growing down), and into the unused ELF_ET_DYN_BASE region. */ if (IS_ENABLED(CONFIG_ARCH_HAS_ELF_RANDOMIZE) && - loc->elf_ex.e_type == ET_DYN && !interpreter) - current->mm->brk = current->mm->start_brk = - ELF_ET_DYN_BASE; + elf_ex->e_type == ET_DYN && !interpreter) { + mm->brk = mm->start_brk = ELF_ET_DYN_BASE; + } - current->mm->brk = current->mm->start_brk = - arch_randomize_brk(current->mm); + mm->brk = mm->start_brk = arch_randomize_brk(mm); #ifdef compat_brk_randomized current->brk_randomized = 1; #endif @@ -1574,6 +1577,7 @@ static void fill_siginfo_note(struct memelfnote *note, user_siginfo_t *csigdata, */ static int fill_files_note(struct memelfnote *note) { + struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned count, size, names_ofs, remaining, n; user_long_t *data; @@ -1581,7 +1585,7 @@ static int fill_files_note(struct memelfnote *note) char *name_base, *name_curpos; /* *Estimated* file count and total data size needed */ - count = current->mm->map_count; + count = mm->map_count; if (count > UINT_MAX / 64) return -EINVAL; size = count * 64; @@ -1591,6 +1595,10 @@ static int fill_files_note(struct memelfnote *note) if (size >= MAX_FILE_NOTE_SIZE) /* paranoia check */ return -EINVAL; size = round_up(size, PAGE_SIZE); + /* + * "size" can be 0 here legitimately. + * Let it ENOMEM and omit NT_FILE section which will be empty anyway. + */ data = kvmalloc(size, GFP_KERNEL); if (ZERO_OR_NULL_PTR(data)) return -ENOMEM; @@ -1599,7 +1607,7 @@ static int fill_files_note(struct memelfnote *note) name_base = name_curpos = ((char *)data) + names_ofs; remaining = size - names_ofs; count = 0; - for (vma = current->mm->mmap; vma != NULL; vma = vma->vm_next) { + for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) { struct file *file; const char *filename; @@ -1633,10 +1641,10 @@ static int fill_files_note(struct memelfnote *note) data[0] = count; data[1] = PAGE_SIZE; /* - * Count usually is less than current->mm->map_count, + * Count usually is less than mm->map_count, * we need to move filenames down. */ - n = current->mm->map_count - count; + n = mm->map_count - count; if (n != 0) { unsigned shift_bytes = n * 3 * sizeof(data[0]); memmove(name_base - shift_bytes, name_base, @@ -2182,7 +2190,7 @@ static int elf_core_dump(struct coredump_params *cprm) int segs, i; size_t vma_data_size = 0; struct vm_area_struct *vma, *gate_vma; - struct elfhdr *elf = NULL; + struct elfhdr elf; loff_t offset = 0, dataoff; struct elf_note_info info = { }; struct elf_phdr *phdr4note = NULL; @@ -2203,10 +2211,6 @@ static int elf_core_dump(struct coredump_params *cprm) * exists while dumping the mm->vm_next areas to the core file. */ - /* alloc memory for large data structures: too large to be on stack */ - elf = kmalloc(sizeof(*elf), GFP_KERNEL); - if (!elf) - goto out; /* * The number of segs are recored into ELF header as 16bit value. * Please check DEFAULT_MAX_MAP_COUNT definition when you modify here. @@ -2230,7 +2234,7 @@ static int elf_core_dump(struct coredump_params *cprm) * Collect all the non-memory information about the process for the * notes. This also sets up the file header. */ - if (!fill_note_info(elf, e_phnum, &info, cprm->siginfo, cprm->regs)) + if (!fill_note_info(&elf, e_phnum, &info, cprm->siginfo, cprm->regs)) goto cleanup; has_dumped = 1; @@ -2238,7 +2242,7 @@ static int elf_core_dump(struct coredump_params *cprm) fs = get_fs(); set_fs(KERNEL_DS); - offset += sizeof(*elf); /* Elf header */ + offset += sizeof(elf); /* Elf header */ offset += segs * sizeof(struct elf_phdr); /* Program headers */ /* Write notes phdr entry */ @@ -2257,11 +2261,13 @@ static int elf_core_dump(struct coredump_params *cprm) dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE); - if (segs - 1 > ULONG_MAX / sizeof(*vma_filesz)) - goto end_coredump; + /* + * Zero vma process will get ZERO_SIZE_PTR here. + * Let coredump continue for register state at least. + */ vma_filesz = kvmalloc(array_size(sizeof(*vma_filesz), (segs - 1)), GFP_KERNEL); - if (ZERO_OR_NULL_PTR(vma_filesz)) + if (!vma_filesz) goto end_coredump; for (i = 0, vma = first_vma(current, gate_vma); vma != NULL; @@ -2281,12 +2287,12 @@ static int elf_core_dump(struct coredump_params *cprm) shdr4extnum = kmalloc(sizeof(*shdr4extnum), GFP_KERNEL); if (!shdr4extnum) goto end_coredump; - fill_extnum_info(elf, shdr4extnum, e_shoff, segs); + fill_extnum_info(&elf, shdr4extnum, e_shoff, segs); } offset = dataoff; - if (!dump_emit(cprm, elf, sizeof(*elf))) + if (!dump_emit(cprm, &elf, sizeof(elf))) goto end_coredump; if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note))) @@ -2370,8 +2376,6 @@ cleanup: kfree(shdr4extnum); kvfree(vma_filesz); kfree(phdr4note); - kfree(elf); -out: return has_dumped; } diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index de95ad27722f..9ab610cc9114 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -1290,7 +1290,7 @@ int btrfs_decompress_buf2page(const char *buf, unsigned long buf_start, /* copy bytes from the working buffer into the pages */ while (working_bytes > 0) { bytes = min_t(unsigned long, bvec.bv_len, - PAGE_SIZE - buf_offset); + PAGE_SIZE - (buf_offset % PAGE_SIZE)); bytes = min(bytes, working_bytes); kaddr = kmap_atomic(bvec.bv_page); diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c index a6c90a003c12..05615a1099db 100644 --- a/fs/btrfs/zlib.c +++ b/fs/btrfs/zlib.c @@ -20,9 +20,13 @@ #include <linux/refcount.h> #include "compression.h" +/* workspace buffer size for s390 zlib hardware support */ +#define ZLIB_DFLTCC_BUF_SIZE (4 * PAGE_SIZE) + struct workspace { z_stream strm; char *buf; + unsigned int buf_size; struct list_head list; int level; }; @@ -61,7 +65,21 @@ struct list_head *zlib_alloc_workspace(unsigned int level) zlib_inflate_workspacesize()); workspace->strm.workspace = kvmalloc(workspacesize, GFP_KERNEL); workspace->level = level; - workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL); + workspace->buf = NULL; + /* + * In case of s390 zlib hardware support, allocate lager workspace + * buffer. If allocator fails, fall back to a single page buffer. + */ + if (zlib_deflate_dfltcc_enabled()) { + workspace->buf = kmalloc(ZLIB_DFLTCC_BUF_SIZE, + __GFP_NOMEMALLOC | __GFP_NORETRY | + __GFP_NOWARN | GFP_NOIO); + workspace->buf_size = ZLIB_DFLTCC_BUF_SIZE; + } + if (!workspace->buf) { + workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL); + workspace->buf_size = PAGE_SIZE; + } if (!workspace->strm.workspace || !workspace->buf) goto fail; @@ -85,6 +103,7 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping, struct page *in_page = NULL; struct page *out_page = NULL; unsigned long bytes_left; + unsigned int in_buf_pages; unsigned long len = *total_out; unsigned long nr_dest_pages = *out_pages; const unsigned long max_out = nr_dest_pages * PAGE_SIZE; @@ -102,9 +121,6 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping, workspace->strm.total_in = 0; workspace->strm.total_out = 0; - in_page = find_get_page(mapping, start >> PAGE_SHIFT); - data_in = kmap(in_page); - out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); if (out_page == NULL) { ret = -ENOMEM; @@ -114,12 +130,51 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping, pages[0] = out_page; nr_pages = 1; - workspace->strm.next_in = data_in; + workspace->strm.next_in = workspace->buf; + workspace->strm.avail_in = 0; workspace->strm.next_out = cpage_out; workspace->strm.avail_out = PAGE_SIZE; - workspace->strm.avail_in = min(len, PAGE_SIZE); while (workspace->strm.total_in < len) { + /* + * Get next input pages and copy the contents to + * the workspace buffer if required. + */ + if (workspace->strm.avail_in == 0) { + bytes_left = len - workspace->strm.total_in; + in_buf_pages = min(DIV_ROUND_UP(bytes_left, PAGE_SIZE), + workspace->buf_size / PAGE_SIZE); + if (in_buf_pages > 1) { + int i; + + for (i = 0; i < in_buf_pages; i++) { + if (in_page) { + kunmap(in_page); + put_page(in_page); + } + in_page = find_get_page(mapping, + start >> PAGE_SHIFT); + data_in = kmap(in_page); + memcpy(workspace->buf + i * PAGE_SIZE, + data_in, PAGE_SIZE); + start += PAGE_SIZE; + } + workspace->strm.next_in = workspace->buf; + } else { + if (in_page) { + kunmap(in_page); + put_page(in_page); + } + in_page = find_get_page(mapping, + start >> PAGE_SHIFT); + data_in = kmap(in_page); + start += PAGE_SIZE; + workspace->strm.next_in = data_in; + } + workspace->strm.avail_in = min(bytes_left, + (unsigned long) workspace->buf_size); + } + ret = zlib_deflate(&workspace->strm, Z_SYNC_FLUSH); if (ret != Z_OK) { pr_debug("BTRFS: deflate in loop returned %d\n", @@ -161,33 +216,43 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping, /* we're all done */ if (workspace->strm.total_in >= len) break; - - /* we've read in a full page, get a new one */ - if (workspace->strm.avail_in == 0) { - if (workspace->strm.total_out > max_out) - break; - - bytes_left = len - workspace->strm.total_in; - kunmap(in_page); - put_page(in_page); - - start += PAGE_SIZE; - in_page = find_get_page(mapping, - start >> PAGE_SHIFT); - data_in = kmap(in_page); - workspace->strm.avail_in = min(bytes_left, - PAGE_SIZE); - workspace->strm.next_in = data_in; - } + if (workspace->strm.total_out > max_out) + break; } workspace->strm.avail_in = 0; - ret = zlib_deflate(&workspace->strm, Z_FINISH); - zlib_deflateEnd(&workspace->strm); - - if (ret != Z_STREAM_END) { - ret = -EIO; - goto out; + /* + * Call deflate with Z_FINISH flush parameter providing more output + * space but no more input data, until it returns with Z_STREAM_END. + */ + while (ret != Z_STREAM_END) { + ret = zlib_deflate(&workspace->strm, Z_FINISH); + if (ret == Z_STREAM_END) + break; + if (ret != Z_OK && ret != Z_BUF_ERROR) { + zlib_deflateEnd(&workspace->strm); + ret = -EIO; + goto out; + } else if (workspace->strm.avail_out == 0) { + /* get another page for the stream end */ + kunmap(out_page); + if (nr_pages == nr_dest_pages) { + out_page = NULL; + ret = -E2BIG; + goto out; + } + out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); + if (out_page == NULL) { + ret = -ENOMEM; + goto out; + } + cpage_out = kmap(out_page); + pages[nr_pages] = out_page; + nr_pages++; + workspace->strm.avail_out = PAGE_SIZE; + workspace->strm.next_out = cpage_out; + } } + zlib_deflateEnd(&workspace->strm); if (workspace->strm.total_out >= workspace->strm.total_in) { ret = -E2BIG; @@ -231,7 +296,7 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb) workspace->strm.total_out = 0; workspace->strm.next_out = workspace->buf; - workspace->strm.avail_out = PAGE_SIZE; + workspace->strm.avail_out = workspace->buf_size; /* If it's deflate, and it's got no preset dictionary, then we can tell zlib to skip the adler32 check. */ @@ -270,7 +335,7 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb) } workspace->strm.next_out = workspace->buf; - workspace->strm.avail_out = PAGE_SIZE; + workspace->strm.avail_out = workspace->buf_size; if (workspace->strm.avail_in == 0) { unsigned long tmp; @@ -320,7 +385,7 @@ int zlib_decompress(struct list_head *ws, unsigned char *data_in, workspace->strm.total_in = 0; workspace->strm.next_out = workspace->buf; - workspace->strm.avail_out = PAGE_SIZE; + workspace->strm.avail_out = workspace->buf_size; workspace->strm.total_out = 0; /* If it's deflate, and it's got no preset dictionary, then we can tell zlib to skip the adler32 check. */ @@ -364,7 +429,7 @@ int zlib_decompress(struct list_head *ws, unsigned char *data_in, buf_offset = 0; bytes = min(PAGE_SIZE - pg_offset, - PAGE_SIZE - buf_offset); + PAGE_SIZE - (buf_offset % PAGE_SIZE)); bytes = min(bytes, bytes_left); kaddr = kmap_atomic(dest_page); @@ -375,7 +440,7 @@ int zlib_decompress(struct list_head *ws, unsigned char *data_in, bytes_left -= bytes; next: workspace->strm.next_out = workspace->buf; - workspace->strm.avail_out = PAGE_SIZE; + workspace->strm.avail_out = workspace->buf_size; } if (ret != Z_STREAM_END && bytes_left != 0) diff --git a/fs/exec.c b/fs/exec.c index 3dd09e1e3e5e..db17be51b112 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -760,6 +760,11 @@ int setup_arg_pages(struct linux_binprm *bprm, goto out_unlock; BUG_ON(prev != vma); + if (unlikely(vm_flags & VM_EXEC)) { + pr_warn_once("process '%pD4' started with executable stack\n", + bprm->file); + } + /* Move stack pages down in memory. */ if (stack_shift) { ret = shift_arg_pages(vma, stack_shift); diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 335607b8c5c0..76ac9c7d32ec 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -2063,7 +2063,7 @@ void wb_workfn(struct work_struct *work) struct bdi_writeback, dwork); long pages_written; - set_worker_desc("flush-%s", dev_name(wb->bdi->dev)); + set_worker_desc("flush-%s", bdi_dev_name(wb->bdi)); current->flags |= PF_SWAPWRITE; if (likely(!current_is_workqueue_rescuer() || diff --git a/fs/io_uring.c b/fs/io_uring.c index ac5340fdcdfe..1806afddfea5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6005,7 +6005,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; for (j = 0; j < imu->nr_bvecs; j++) - put_user_page(imu->bvec[j].bv_page); + unpin_user_page(imu->bvec[j].bv_page); if (ctx->account_mem) io_unaccount_mem(ctx->user, imu->nr_bvecs); @@ -6126,7 +6126,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, ret = 0; down_read(¤t->mm->mmap_sem); - pret = get_user_pages(ubuf, nr_pages, + pret = pin_user_pages(ubuf, nr_pages, FOLL_WRITE | FOLL_LONGTERM, pages, vmas); if (pret == nr_pages) { @@ -6150,7 +6150,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, * release any pages we did get */ if (pret > 0) - put_user_pages(pages, pret); + unpin_user_pages(pages, pret); if (ctx->account_mem) io_unaccount_mem(ctx->user, nr_pages); kvfree(imu->bvec); diff --git a/fs/ocfs2/cluster/quorum.c b/fs/ocfs2/cluster/quorum.c index 5c424a099280..1ef24574f481 100644 --- a/fs/ocfs2/cluster/quorum.c +++ b/fs/ocfs2/cluster/quorum.c @@ -73,7 +73,7 @@ static void o2quo_fence_self(void) "system by restarting ***\n"); emergency_restart(); break; - }; + } } /* Indicate that a timeout occurred on a heartbeat region write. The diff --git a/fs/ocfs2/dlm/Makefile b/fs/ocfs2/dlm/Makefile index 38b224372776..5e700b45d32d 100644 --- a/fs/ocfs2/dlm/Makefile +++ b/fs/ocfs2/dlm/Makefile @@ -1,6 +1,4 @@ # SPDX-License-Identifier: GPL-2.0-only -ccflags-y := -I $(srctree)/$(src)/.. - obj-$(CONFIG_OCFS2_FS_O2CB) += ocfs2_dlm.o ocfs2_dlm-objs := dlmdomain.o dlmdebug.o dlmthread.o dlmrecovery.o \ diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c index 4de89af96abf..6abaded3ff6b 100644 --- a/fs/ocfs2/dlm/dlmast.c +++ b/fs/ocfs2/dlm/dlmast.c @@ -23,15 +23,15 @@ #include <linux/spinlock.h> -#include "cluster/heartbeat.h" -#include "cluster/nodemanager.h" -#include "cluster/tcp.h" +#include "../cluster/heartbeat.h" +#include "../cluster/nodemanager.h" +#include "../cluster/tcp.h" #include "dlmapi.h" #include "dlmcommon.h" #define MLOG_MASK_PREFIX ML_DLM -#include "cluster/masklog.h" +#include "../cluster/masklog.h" static void dlm_update_lvb(struct dlm_ctxt *dlm, struct dlm_lock_resource *res, struct dlm_lock *lock); diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h index aaf24548b02a..0463dce65bb2 100644 --- a/fs/ocfs2/dlm/dlmcommon.h +++ b/fs/ocfs2/dlm/dlmcommon.h @@ -688,10 +688,6 @@ struct dlm_begin_reco __be32 pad2; }; - -#define BITS_PER_BYTE 8 -#define BITS_TO_BYTES(bits) (((bits)+BITS_PER_BYTE-1)/BITS_PER_BYTE) - struct dlm_query_join_request { u8 node_idx; diff --git a/fs/ocfs2/dlm/dlmconvert.c b/fs/ocfs2/dlm/dlmconvert.c index 965f45dbe17b..6051edc33aef 100644 --- a/fs/ocfs2/dlm/dlmconvert.c +++ b/fs/ocfs2/dlm/dlmconvert.c @@ -23,9 +23,9 @@ #include <linux/spinlock.h> -#include "cluster/heartbeat.h" -#include "cluster/nodemanager.h" -#include "cluster/tcp.h" +#include "../cluster/heartbeat.h" +#include "../cluster/nodemanager.h" +#include "../cluster/tcp.h" #include "dlmapi.h" #include "dlmcommon.h" @@ -33,7 +33,7 @@ #include "dlmconvert.h" #define MLOG_MASK_PREFIX ML_DLM -#include "cluster/masklog.h" +#include "../cluster/masklog.h" /* NOTE: __dlmconvert_master is the only function in here that * needs a spinlock held on entry (res->spinlock) and it is the diff --git a/fs/ocfs2/dlm/dlmdebug.c b/fs/ocfs2/dlm/dlmdebug.c index 4d0b452012b2..c5c6efba7b5e 100644 --- a/fs/ocfs2/dlm/dlmdebug.c +++ b/fs/ocfs2/dlm/dlmdebug.c @@ -17,9 +17,9 @@ #include <linux/debugfs.h> #include <linux/export.h> -#include "cluster/heartbeat.h" -#include "cluster/nodemanager.h" -#include "cluster/tcp.h" +#include "../cluster/heartbeat.h" +#include "../cluster/nodemanager.h" +#include "../cluster/tcp.h" #include "dlmapi.h" #include "dlmcommon.h" @@ -27,7 +27,7 @@ #include "dlmdebug.h" #define MLOG_MASK_PREFIX ML_DLM -#include "cluster/masklog.h" +#include "../cluster/masklog.h" static int stringify_lockname(const char *lockname, int locklen, char *buf, int len); diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c index ee6f459f9770..357cfc702ce3 100644 --- a/fs/ocfs2/dlm/dlmdomain.c +++ b/fs/ocfs2/dlm/dlmdomain.c @@ -20,9 +20,9 @@ #include <linux/debugfs.h> #include <linux/sched/signal.h> -#include "cluster/heartbeat.h" -#include "cluster/nodemanager.h" -#include "cluster/tcp.h" +#include "../cluster/heartbeat.h" +#include "../cluster/nodemanager.h" +#include "../cluster/tcp.h" #include "dlmapi.h" #include "dlmcommon.h" @@ -30,7 +30,7 @@ #include "dlmdebug.h" #define MLOG_MASK_PREFIX (ML_DLM|ML_DLM_DOMAIN) -#include "cluster/masklog.h" +#include "../cluster/masklog.h" /* * ocfs2 node maps are array of long int, which limits to send them freely diff --git a/fs/ocfs2/dlm/dlmlock.c b/fs/ocfs2/dlm/dlmlock.c index baff087f3863..83f0760e4fba 100644 --- a/fs/ocfs2/dlm/dlmlock.c +++ b/fs/ocfs2/dlm/dlmlock.c @@ -25,9 +25,9 @@ #include <linux/delay.h> -#include "cluster/heartbeat.h" -#include "cluster/nodemanager.h" -#include "cluster/tcp.h" +#include "../cluster/heartbeat.h" +#include "../cluster/nodemanager.h" +#include "../cluster/tcp.h" #include "dlmapi.h" #include "dlmcommon.h" @@ -35,7 +35,7 @@ #include "dlmconvert.h" #define MLOG_MASK_PREFIX ML_DLM -#include "cluster/masklog.h" +#include "../cluster/masklog.h" static struct kmem_cache *dlm_lock_cache; diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 74b768ca1cd8..900f7e466d11 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -25,9 +25,9 @@ #include <linux/delay.h> -#include "cluster/heartbeat.h" -#include "cluster/nodemanager.h" -#include "cluster/tcp.h" +#include "../cluster/heartbeat.h" +#include "../cluster/nodemanager.h" +#include "../cluster/tcp.h" #include "dlmapi.h" #include "dlmcommon.h" @@ -35,7 +35,7 @@ #include "dlmdebug.h" #define MLOG_MASK_PREFIX (ML_DLM|ML_DLM_MASTER) -#include "cluster/masklog.h" +#include "../cluster/masklog.h" static void dlm_mle_node_down(struct dlm_ctxt *dlm, struct dlm_master_list_entry *mle, @@ -2554,8 +2554,6 @@ static int dlm_migrate_lockres(struct dlm_ctxt *dlm, if (!dlm_grab(dlm)) return -EINVAL; - BUG_ON(target == O2NM_MAX_NODES); - name = res->lockname.name; namelen = res->lockname.len; diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 064ce5bbc3f6..4b566e88582f 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -26,16 +26,16 @@ #include <linux/delay.h> -#include "cluster/heartbeat.h" -#include "cluster/nodemanager.h" -#include "cluster/tcp.h" +#include "../cluster/heartbeat.h" +#include "../cluster/nodemanager.h" +#include "../cluster/tcp.h" #include "dlmapi.h" #include "dlmcommon.h" #include "dlmdomain.h" #define MLOG_MASK_PREFIX (ML_DLM|ML_DLM_RECOVERY) -#include "cluster/masklog.h" +#include "../cluster/masklog.h" static void dlm_do_local_recovery_cleanup(struct dlm_ctxt *dlm, u8 dead_node); @@ -1668,7 +1668,7 @@ static int dlm_lockres_master_requery(struct dlm_ctxt *dlm, int dlm_do_master_requery(struct dlm_ctxt *dlm, struct dlm_lock_resource *res, u8 nodenum, u8 *real_master) { - int ret = -EINVAL; + int ret; struct dlm_master_requery req; int status = DLM_LOCK_RES_OWNER_UNKNOWN; diff --git a/fs/ocfs2/dlm/dlmthread.c b/fs/ocfs2/dlm/dlmthread.c index 61c51c268460..fd40c17cd022 100644 --- a/fs/ocfs2/dlm/dlmthread.c +++ b/fs/ocfs2/dlm/dlmthread.c @@ -25,16 +25,16 @@ #include <linux/delay.h> -#include "cluster/heartbeat.h" -#include "cluster/nodemanager.h" -#include "cluster/tcp.h" +#include "../cluster/heartbeat.h" +#include "../cluster/nodemanager.h" +#include "../cluster/tcp.h" #include "dlmapi.h" #include "dlmcommon.h" #include "dlmdomain.h" #define MLOG_MASK_PREFIX (ML_DLM|ML_DLM_THREAD) -#include "cluster/masklog.h" +#include "../cluster/masklog.h" static int dlm_thread(void *data); static void dlm_flush_asts(struct dlm_ctxt *dlm); diff --git a/fs/ocfs2/dlm/dlmunlock.c b/fs/ocfs2/dlm/dlmunlock.c index 3883633e82eb..dcb17ca8ae74 100644 --- a/fs/ocfs2/dlm/dlmunlock.c +++ b/fs/ocfs2/dlm/dlmunlock.c @@ -23,15 +23,15 @@ #include <linux/spinlock.h> #include <linux/delay.h> -#include "cluster/heartbeat.h" -#include "cluster/nodemanager.h" -#include "cluster/tcp.h" +#include "../cluster/heartbeat.h" +#include "../cluster/nodemanager.h" +#include "../cluster/tcp.h" #include "dlmapi.h" #include "dlmcommon.h" #define MLOG_MASK_PREFIX ML_DLM -#include "cluster/masklog.h" +#include "../cluster/masklog.h" #define DLM_UNLOCK_FREE_LOCK 0x00000001 #define DLM_UNLOCK_CALL_AST 0x00000002 diff --git a/fs/ocfs2/dlmfs/Makefile b/fs/ocfs2/dlmfs/Makefile index a9874e441bd4..c7895f65be0e 100644 --- a/fs/ocfs2/dlmfs/Makefile +++ b/fs/ocfs2/dlmfs/Makefile @@ -1,6 +1,4 @@ # SPDX-License-Identifier: GPL-2.0-only -ccflags-y := -I $(srctree)/$(src)/.. - obj-$(CONFIG_OCFS2_FS) += ocfs2_dlmfs.o ocfs2_dlmfs-objs := userdlm.o dlmfs.o diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c index 4f1668c81e1f..8e4f1ace467c 100644 --- a/fs/ocfs2/dlmfs/dlmfs.c +++ b/fs/ocfs2/dlmfs/dlmfs.c @@ -33,11 +33,11 @@ #include <linux/uaccess.h> -#include "stackglue.h" +#include "../stackglue.h" #include "userdlm.h" #define MLOG_MASK_PREFIX ML_DLMFS -#include "cluster/masklog.h" +#include "../cluster/masklog.h" static const struct super_operations dlmfs_ops; diff --git a/fs/ocfs2/dlmfs/userdlm.c b/fs/ocfs2/dlmfs/userdlm.c index 525b14ddfba5..3df5be25bfb1 100644 --- a/fs/ocfs2/dlmfs/userdlm.c +++ b/fs/ocfs2/dlmfs/userdlm.c @@ -21,12 +21,12 @@ #include <linux/types.h> #include <linux/crc32.h> -#include "ocfs2_lockingver.h" -#include "stackglue.h" +#include "../ocfs2_lockingver.h" +#include "../stackglue.h" #include "userdlm.h" #define MLOG_MASK_PREFIX ML_DLMFS -#include "cluster/masklog.h" +#include "../cluster/masklog.h" static inline struct user_lock_res *user_lksb_to_lock_res(struct ocfs2_dlm_lksb *lksb) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index cda1027d0819..cb9e6a73bea9 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -570,7 +570,7 @@ void ocfs2_inode_lock_res_init(struct ocfs2_lock_res *res, mlog_bug_on_msg(1, "type: %d\n", type); ops = NULL; /* thanks, gcc */ break; - }; + } ocfs2_build_lock_name(type, OCFS2_I(inode)->ip_blkno, generation, res->l_name); diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h index 3103ba7f97a2..bfe611ed1b1d 100644 --- a/fs/ocfs2/journal.h +++ b/fs/ocfs2/journal.h @@ -597,9 +597,11 @@ static inline void ocfs2_update_inode_fsync_trans(handle_t *handle, { struct ocfs2_inode_info *oi = OCFS2_I(inode); - oi->i_sync_tid = handle->h_transaction->t_tid; - if (datasync) - oi->i_datasync_tid = handle->h_transaction->t_tid; + if (!is_handle_aborted(handle)) { + oi->i_sync_tid = handle->h_transaction->t_tid; + if (datasync) + oi->i_datasync_tid = handle->h_transaction->t_tid; + } } #endif /* OCFS2_JOURNAL_H */ diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index 8ea51cf27b97..da65251ef815 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -586,8 +586,7 @@ static int __ocfs2_mknod_locked(struct inode *dir, mlog_errno(status); } - oi->i_sync_tid = handle->h_transaction->t_tid; - oi->i_datasync_tid = handle->h_transaction->t_tid; + ocfs2_update_inode_fsync_trans(handle, inode, 1); leave: if (status < 0) { diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 6051e7bbc221..8bf88d690729 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -2240,7 +2240,8 @@ error_out: /* also releases the path */ unfix_nodes(&s_ins_balance); #ifdef REISERQUOTA_DEBUG - reiserfs_debug(th->t_super, REISERFS_DEBUG_CODE, + if (inode) + reiserfs_debug(th->t_super, REISERFS_DEBUG_CODE, "reiserquota insert_item(): freeing %u id=%u type=%c", quota_bytes, inode->i_uid, head2type(ih)); #endif diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 97967ce06de3..f88197c1ffc2 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -13,6 +13,7 @@ #include <linux/fs.h> #include <linux/sched.h> #include <linux/blkdev.h> +#include <linux/device.h> #include <linux/writeback.h> #include <linux/blk-cgroup.h> #include <linux/backing-dev-defs.h> @@ -504,4 +505,13 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi) (1 << WB_async_congested)); } +extern const char *bdi_unknown_name; + +static inline const char *bdi_dev_name(struct backing_dev_info *bdi) +{ + if (!bdi || !bdi->dev) + return bdi_unknown_name; + return dev_name(bdi->dev); +} + #endif /* _LINUX_BACKING_DEV_H */ diff --git a/include/linux/bitops.h b/include/linux/bitops.h index e479067c202c..6c7c4133c25c 100644 --- a/include/linux/bitops.h +++ b/include/linux/bitops.h @@ -13,6 +13,7 @@ #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE) #define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_TYPE(long)) +#define BITS_TO_BYTES(nr) DIV_ROUND_UP(nr, BITS_PER_TYPE(char)) extern unsigned int __sw_hweight8(unsigned int w); extern unsigned int __sw_hweight16(unsigned int w); diff --git a/include/linux/fs.h b/include/linux/fs.h index 40be2ccb87f3..41584f50af0d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2737,7 +2737,6 @@ static inline int filemap_fdatawait(struct address_space *mapping) extern bool filemap_range_has_page(struct address_space *, loff_t lstart, loff_t lend); -extern int filemap_write_and_wait(struct address_space *mapping); extern int filemap_write_and_wait_range(struct address_space *mapping, loff_t lstart, loff_t lend); extern int __filemap_fdatawrite_range(struct address_space *mapping, @@ -2747,6 +2746,11 @@ extern int filemap_fdatawrite_range(struct address_space *mapping, extern int filemap_check_errors(struct address_space *mapping); extern void __filemap_set_wb_err(struct address_space *mapping, int err); +static inline int filemap_write_and_wait(struct address_space *mapping) +{ + return filemap_write_and_wait_range(mapping, 0, LLONG_MAX); +} + extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart, loff_t lend); extern int __must_check file_check_and_advance_wb_err(struct file *file); diff --git a/include/linux/io-mapping.h b/include/linux/io-mapping.h index 6e125e9b4187..837058bc1c9f 100644 --- a/include/linux/io-mapping.h +++ b/include/linux/io-mapping.h @@ -28,6 +28,7 @@ struct io_mapping { #ifdef CONFIG_HAVE_ATOMIC_IOMAP +#include <linux/pfn.h> #include <asm/iomap.h> /* * For small address space machines, mapping large objects @@ -64,12 +65,10 @@ io_mapping_map_atomic_wc(struct io_mapping *mapping, unsigned long offset) { resource_size_t phys_addr; - unsigned long pfn; BUG_ON(offset >= mapping->size); phys_addr = mapping->base + offset; - pfn = (unsigned long) (phys_addr >> PAGE_SHIFT); - return iomap_atomic_prot_pfn(pfn, mapping->prot); + return iomap_atomic_prot_pfn(PHYS_PFN(phys_addr), mapping->prot); } static inline void diff --git a/include/linux/memblock.h b/include/linux/memblock.h index b38bbefabfab..079d17d96410 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -113,6 +113,9 @@ int memblock_add(phys_addr_t base, phys_addr_t size); int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_free(phys_addr_t base, phys_addr_t size); int memblock_reserve(phys_addr_t base, phys_addr_t size); +#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP +int memblock_physmem_add(phys_addr_t base, phys_addr_t size); +#endif void memblock_trim_memory(phys_addr_t align); bool memblock_overlaps_region(struct memblock_type *type, phys_addr_t base, phys_addr_t size); @@ -127,10 +130,6 @@ void reset_node_managed_pages(pg_data_t *pgdat); void reset_all_zones_managed_pages(void); /* Low level functions */ -int memblock_add_range(struct memblock_type *type, - phys_addr_t base, phys_addr_t size, - int nid, enum memblock_flags flags); - void __next_mem_range(u64 *idx, int nid, enum memblock_flags flags, struct memblock_type *type_a, struct memblock_type *type_b, phys_addr_t *out_start, diff --git a/include/linux/memory.h b/include/linux/memory.h index 4c75dae8dd29..0b8d791b6669 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -29,8 +29,6 @@ struct memory_block { int section_count; /* serialized by mem_sysfs_mutex */ int online_type; /* for passing data to online routine */ int phys_device; /* to which fru does this belong? */ - void *hw; /* optional pointer to fw/hw data */ - int (*phys_callback)(struct memory_block *); struct device dev; int nid; /* NID for this memory block */ }; @@ -55,19 +53,6 @@ struct memory_notify { int status_change_nid; }; -/* - * During pageblock isolation, count the number of pages within the - * range [start_pfn, start_pfn + nr_pages) which are owned by code - * in the notifier chain. - */ -#define MEM_ISOLATE_COUNT (1<<0) - -struct memory_isolate_notify { - unsigned long start_pfn; /* Start of range to check */ - unsigned int nr_pages; /* # pages in range to check */ - unsigned int pages_found; /* # pages owned found by callbacks */ -}; - struct notifier_block; struct mem_section; @@ -94,27 +79,13 @@ static inline int memory_notify(unsigned long val, void *v) { return 0; } -static inline int register_memory_isolate_notifier(struct notifier_block *nb) -{ - return 0; -} -static inline void unregister_memory_isolate_notifier(struct notifier_block *nb) -{ -} -static inline int memory_isolate_notify(unsigned long val, void *v) -{ - return 0; -} #else extern int register_memory_notifier(struct notifier_block *nb); extern void unregister_memory_notifier(struct notifier_block *nb); -extern int register_memory_isolate_notifier(struct notifier_block *nb); -extern void unregister_memory_isolate_notifier(struct notifier_block *nb); int create_memory_block_devices(unsigned long start, unsigned long size); void remove_memory_block_devices(unsigned long start, unsigned long size); extern void memory_dev_init(void); extern int memory_notify(unsigned long val, void *v); -extern int memory_isolate_notify(unsigned long val, void *v); extern struct memory_block *find_memory_block(struct mem_section *); typedef int (*walk_memory_blocks_func_t)(struct memory_block *, void *); extern int walk_memory_blocks(unsigned long start, unsigned long size, diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index ba0dca6aac6e..ffa6ad12d84a 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -94,7 +94,8 @@ extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages); extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages); extern int add_one_highpage(struct page *page, int pfn, int bad_ppro); /* VM interface that may be used by firmware interface */ -extern int online_pages(unsigned long, unsigned long, int); +extern int online_pages(unsigned long pfn, unsigned long nr_pages, + int online_type, int nid); extern int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn, unsigned long *valid_start, unsigned long *valid_end); extern unsigned long __offline_isolated_pages(unsigned long start_pfn, diff --git a/include/linux/mm.h b/include/linux/mm.h index 1233bf45164d..73a044ed6981 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -70,11 +70,6 @@ static inline void totalram_pages_add(long count) atomic_long_add(count, &_totalram_pages); } -static inline void totalram_pages_set(long val) -{ - atomic_long_set(&_totalram_pages, val); -} - extern void * high_memory; extern int page_cluster; @@ -916,10 +911,6 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define ZONEID_PGSHIFT (ZONEID_PGOFF * (ZONEID_SHIFT != 0)) -#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS -#error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS -#endif - #define ZONES_MASK ((1UL << ZONES_WIDTH) - 1) #define NODES_MASK ((1UL << NODES_WIDTH) - 1) #define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1) @@ -947,9 +938,10 @@ static inline bool is_zone_device_page(const struct page *page) #endif #ifdef CONFIG_DEV_PAGEMAP_OPS -void __put_devmap_managed_page(struct page *page); +void free_devmap_managed_page(struct page *page); DECLARE_STATIC_KEY_FALSE(devmap_managed_key); -static inline bool put_devmap_managed_page(struct page *page) + +static inline bool page_is_devmap_managed(struct page *page) { if (!static_branch_unlikely(&devmap_managed_key)) return false; @@ -958,7 +950,6 @@ static inline bool put_devmap_managed_page(struct page *page) switch (page->pgmap->type) { case MEMORY_DEVICE_PRIVATE: case MEMORY_DEVICE_FS_DAX: - __put_devmap_managed_page(page); return true; default: break; @@ -966,11 +957,17 @@ static inline bool put_devmap_managed_page(struct page *page) return false; } +void put_devmap_managed_page(struct page *page); + #else /* CONFIG_DEV_PAGEMAP_OPS */ -static inline bool put_devmap_managed_page(struct page *page) +static inline bool page_is_devmap_managed(struct page *page) { return false; } + +static inline void put_devmap_managed_page(struct page *page) +{ +} #endif /* CONFIG_DEV_PAGEMAP_OPS */ static inline bool is_device_private_page(const struct page *page) @@ -1023,37 +1020,37 @@ static inline void put_page(struct page *page) * need to inform the device driver through callback. See * include/linux/memremap.h and HMM for details. */ - if (put_devmap_managed_page(page)) + if (page_is_devmap_managed(page)) { + put_devmap_managed_page(page); return; + } if (put_page_testzero(page)) __put_page(page); } /** - * put_user_page() - release a gup-pinned page + * unpin_user_page() - release a gup-pinned page * @page: pointer to page to be released * - * Pages that were pinned via get_user_pages*() must be released via - * either put_user_page(), or one of the put_user_pages*() routines - * below. This is so that eventually, pages that are pinned via - * get_user_pages*() can be separately tracked and uniquely handled. In - * particular, interactions with RDMA and filesystems need special - * handling. + * Pages that were pinned via pin_user_pages*() must be released via either + * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so + * that eventually such pages can be separately tracked and uniquely handled. In + * particular, interactions with RDMA and filesystems need special handling. * - * put_user_page() and put_page() are not interchangeable, despite this early - * implementation that makes them look the same. put_user_page() calls must - * be perfectly matched up with get_user_page() calls. + * unpin_user_page() and put_page() are not interchangeable, despite this early + * implementation that makes them look the same. unpin_user_page() calls must + * be perfectly matched up with pin*() calls. */ -static inline void put_user_page(struct page *page) +static inline void unpin_user_page(struct page *page) { put_page(page); } -void put_user_pages_dirty_lock(struct page **pages, unsigned long npages, - bool make_dirty); +void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, + bool make_dirty); -void put_user_pages(struct page **pages, unsigned long npages); +void unpin_user_pages(struct page **pages, unsigned long npages); #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define SECTION_IN_PAGE_FLAGS @@ -1501,9 +1498,16 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked); +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked); long get_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas); +long pin_user_pages(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas); long get_user_pages_locked(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, int *locked); long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, @@ -1511,6 +1515,8 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); +int pin_user_pages_fast(unsigned long start, int nr_pages, + unsigned int gup_flags, struct page **pages); int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc); int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, @@ -2575,13 +2581,15 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_ANON 0x8000 /* don't do file mappings */ #define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ +#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ /* - * NOTE on FOLL_LONGTERM: + * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each + * other. Here is what they mean, and how to use them: * * FOLL_LONGTERM indicates that the page will be held for an indefinite time - * period _often_ under userspace control. This is contrasted with - * iov_iter_get_pages() where usages which are transient. + * period _often_ under userspace control. This is in contrast to + * iov_iter_get_pages(), whose usages are transient. * * FIXME: For pages which are part of a filesystem, mappings are subject to the * lifetime enforced by the filesystem and we need guarantees that longterm @@ -2596,11 +2604,39 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, * Currently only get_user_pages() and get_user_pages_fast() support this flag * and calls to get_user_pages_[un]locked are specifically not allowed. This * is due to an incompatibility with the FS DAX check and - * FAULT_FLAG_ALLOW_RETRY + * FAULT_FLAG_ALLOW_RETRY. * - * In the CMA case: longterm pins in a CMA region would unnecessarily fragment - * that region. And so CMA attempts to migrate the page before pinning when + * In the CMA case: long term pins in a CMA region would unnecessarily fragment + * that region. And so, CMA attempts to migrate the page before pinning, when * FOLL_LONGTERM is specified. + * + * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount, + * but an additional pin counting system) will be invoked. This is intended for + * anything that gets a page reference and then touches page data (for example, + * Direct IO). This lets the filesystem know that some non-file-system entity is + * potentially changing the pages' data. In contrast to FOLL_GET (whose pages + * are released via put_page()), FOLL_PIN pages must be released, ultimately, by + * a call to unpin_user_page(). + * + * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different + * and separate refcounting mechanisms, however, and that means that each has + * its own acquire and release mechanisms: + * + * FOLL_GET: get_user_pages*() to acquire, and put_page() to release. + * + * FOLL_PIN: pin_user_pages*() to acquire, and unpin_user_pages to release. + * + * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call. + * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based + * calls applied to them, and that's perfectly OK. This is a constraint on the + * callers, not on the pages.) + * + * FOLL_PIN should be set internally by the pin_user_pages*() APIs, never + * directly by the caller. That's in order to help avoid mismatches when + * releasing pages: get_user_pages*() pages must be released via put_page(), + * while pin_user_pages*() pages must be released via unpin_user_page(). + * + * Please see Documentation/vm/pin_user_pages.rst for more information. */ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5334ad8fc7bd..c2bc309d1634 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -758,7 +758,7 @@ typedef struct pglist_data { #ifdef CONFIG_NUMA /* - * zone reclaim becomes active if more unmapped pages exist. + * node reclaim becomes active if more unmapped pages exist. */ unsigned long min_unmapped_pages; unsigned long min_slab_pages; diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 6861df759fad..572458016331 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -33,8 +33,8 @@ static inline bool is_migrate_isolate(int migratetype) #define MEMORY_OFFLINE 0x1 #define REPORT_FAILURE 0x2 -bool has_unmovable_pages(struct zone *zone, struct page *page, int count, - int migratetype, int flags); +struct page *has_unmovable_pages(struct zone *zone, struct page *page, + int migratetype, int flags); void set_pageblock_migratetype(struct page *page, int migratetype); int move_freepages_block(struct zone *zone, struct page *page, int migratetype, int *num_movable); diff --git a/include/linux/swab.h b/include/linux/swab.h index e466fd159c85..bcff5149861a 100644 --- a/include/linux/swab.h +++ b/include/linux/swab.h @@ -7,6 +7,7 @@ # define swab16 __swab16 # define swab32 __swab32 # define swab64 __swab64 +# define swab __swab # define swahw32 __swahw32 # define swahb32 __swahb32 # define swab16p __swab16p diff --git a/include/linux/thermal.h b/include/linux/thermal.h index d9111aebb97d..126913c6a53b 100644 --- a/include/linux/thermal.h +++ b/include/linux/thermal.h @@ -32,17 +32,6 @@ /* use value, which < 0K, to indicate an invalid/uninitialized temperature */ #define THERMAL_TEMP_INVALID -274000 -/* Unit conversion macros */ -#define DECI_KELVIN_TO_CELSIUS(t) ({ \ - long _t = (t); \ - ((_t-2732 >= 0) ? (_t-2732+5)/10 : (_t-2732-5)/10); \ -}) -#define CELSIUS_TO_DECI_KELVIN(t) ((t)*10+2732) -#define DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(t, off) (((t) - (off)) * 100) -#define DECI_KELVIN_TO_MILLICELSIUS(t) DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(t, 2732) -#define MILLICELSIUS_TO_DECI_KELVIN_WITH_OFFSET(t, off) (((t) / 100) + (off)) -#define MILLICELSIUS_TO_DECI_KELVIN(t) MILLICELSIUS_TO_DECI_KELVIN_WITH_OFFSET(t, 2732) - /* Default Thermal Governor */ #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE) #define DEFAULT_THERMAL_GOVERNOR "step_wise" diff --git a/include/linux/units.h b/include/linux/units.h new file mode 100644 index 000000000000..aaf716364ec3 --- /dev/null +++ b/include/linux/units.h @@ -0,0 +1,84 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_UNITS_H +#define _LINUX_UNITS_H + +#include <linux/kernel.h> + +#define ABSOLUTE_ZERO_MILLICELSIUS -273150 + +static inline long milli_kelvin_to_millicelsius(long t) +{ + return t + ABSOLUTE_ZERO_MILLICELSIUS; +} + +static inline long millicelsius_to_milli_kelvin(long t) +{ + return t - ABSOLUTE_ZERO_MILLICELSIUS; +} + +#define MILLIDEGREE_PER_DEGREE 1000 +#define MILLIDEGREE_PER_DECIDEGREE 100 + +static inline long kelvin_to_millicelsius(long t) +{ + return milli_kelvin_to_millicelsius(t * MILLIDEGREE_PER_DEGREE); +} + +static inline long millicelsius_to_kelvin(long t) +{ + t = millicelsius_to_milli_kelvin(t); + + return DIV_ROUND_CLOSEST(t, MILLIDEGREE_PER_DEGREE); +} + +static inline long deci_kelvin_to_celsius(long t) +{ + t = milli_kelvin_to_millicelsius(t * MILLIDEGREE_PER_DECIDEGREE); + + return DIV_ROUND_CLOSEST(t, MILLIDEGREE_PER_DEGREE); +} + +static inline long celsius_to_deci_kelvin(long t) +{ + t = millicelsius_to_milli_kelvin(t * MILLIDEGREE_PER_DEGREE); + + return DIV_ROUND_CLOSEST(t, MILLIDEGREE_PER_DECIDEGREE); +} + +/** + * deci_kelvin_to_millicelsius_with_offset - convert Kelvin to Celsius + * @t: temperature value in decidegrees Kelvin + * @offset: difference between Kelvin and Celsius in millidegrees + * + * Return: temperature value in millidegrees Celsius + */ +static inline long deci_kelvin_to_millicelsius_with_offset(long t, long offset) +{ + return t * MILLIDEGREE_PER_DECIDEGREE - offset; +} + +static inline long deci_kelvin_to_millicelsius(long t) +{ + return milli_kelvin_to_millicelsius(t * MILLIDEGREE_PER_DECIDEGREE); +} + +static inline long millicelsius_to_deci_kelvin(long t) +{ + t = millicelsius_to_milli_kelvin(t); + + return DIV_ROUND_CLOSEST(t, MILLIDEGREE_PER_DECIDEGREE); +} + +static inline long kelvin_to_celsius(long t) +{ + return t + DIV_ROUND_CLOSEST(ABSOLUTE_ZERO_MILLICELSIUS, + MILLIDEGREE_PER_DEGREE); +} + +static inline long celsius_to_kelvin(long t) +{ + return t - DIV_ROUND_CLOSEST(ABSOLUTE_ZERO_MILLICELSIUS, + MILLIDEGREE_PER_DEGREE); +} + +#endif /* _LINUX_UNITS_H */ diff --git a/include/linux/zlib.h b/include/linux/zlib.h index 92dbbd3f6c75..c757d848a758 100644 --- a/include/linux/zlib.h +++ b/include/linux/zlib.h @@ -191,6 +191,12 @@ extern int zlib_deflate_workspacesize (int windowBits, int memLevel); exceed those passed here. */ +extern int zlib_deflate_dfltcc_enabled (void); +/* + Returns 1 if Deflate-Conversion facility is installed and enabled, + otherwise 0. +*/ + /* extern int deflateInit (z_streamp strm, int level); diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index ad7e642bd497..f65b1f6db22d 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -88,8 +88,8 @@ DECLARE_EVENT_CLASS(kmem_alloc_node, __entry->node = node; ), - TP_printk("call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d", - __entry->call_site, + TP_printk("call_site=%pS ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d", + (void *)__entry->call_site, __entry->ptr, __entry->bytes_req, __entry->bytes_alloc, diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index ef50be4e5e6c..d94def25e4dc 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -67,8 +67,8 @@ DECLARE_EVENT_CLASS(writeback_page_template, TP_fast_assign( strscpy_pad(__entry->name, - mapping ? dev_name(inode_to_bdi(mapping->host)->dev) : "(unknown)", - 32); + bdi_dev_name(mapping ? inode_to_bdi(mapping->host) : + NULL), 32); __entry->ino = mapping ? mapping->host->i_ino : 0; __entry->index = page->index; ), @@ -111,8 +111,7 @@ DECLARE_EVENT_CLASS(writeback_dirty_inode_template, struct backing_dev_info *bdi = inode_to_bdi(inode); /* may be called for files on pseudo FSes w/ unregistered bdi */ - strscpy_pad(__entry->name, - bdi->dev ? dev_name(bdi->dev) : "(unknown)", 32); + strscpy_pad(__entry->name, bdi_dev_name(bdi), 32); __entry->ino = inode->i_ino; __entry->state = inode->i_state; __entry->flags = flags; @@ -193,7 +192,7 @@ TRACE_EVENT(inode_foreign_history, ), TP_fast_assign( - strncpy(__entry->name, dev_name(inode_to_bdi(inode)->dev), 32); + strncpy(__entry->name, bdi_dev_name(inode_to_bdi(inode)), 32); __entry->ino = inode->i_ino; __entry->cgroup_ino = __trace_wbc_assign_cgroup(wbc); __entry->history = history; @@ -222,7 +221,7 @@ TRACE_EVENT(inode_switch_wbs, ), TP_fast_assign( - strncpy(__entry->name, dev_name(old_wb->bdi->dev), 32); + strncpy(__entry->name, bdi_dev_name(old_wb->bdi), 32); __entry->ino = inode->i_ino; __entry->old_cgroup_ino = __trace_wb_assign_cgroup(old_wb); __entry->new_cgroup_ino = __trace_wb_assign_cgroup(new_wb); @@ -255,7 +254,7 @@ TRACE_EVENT(track_foreign_dirty, struct address_space *mapping = page_mapping(page); struct inode *inode = mapping ? mapping->host : NULL; - strncpy(__entry->name, dev_name(wb->bdi->dev), 32); + strncpy(__entry->name, bdi_dev_name(wb->bdi), 32); __entry->bdi_id = wb->bdi->id; __entry->ino = inode ? inode->i_ino : 0; __entry->memcg_id = wb->memcg_css->id; @@ -288,7 +287,7 @@ TRACE_EVENT(flush_foreign, ), TP_fast_assign( - strncpy(__entry->name, dev_name(wb->bdi->dev), 32); + strncpy(__entry->name, bdi_dev_name(wb->bdi), 32); __entry->cgroup_ino = __trace_wb_assign_cgroup(wb); __entry->frn_bdi_id = frn_bdi_id; __entry->frn_memcg_id = frn_memcg_id; @@ -318,7 +317,7 @@ DECLARE_EVENT_CLASS(writeback_write_inode_template, TP_fast_assign( strscpy_pad(__entry->name, - dev_name(inode_to_bdi(inode)->dev), 32); + bdi_dev_name(inode_to_bdi(inode)), 32); __entry->ino = inode->i_ino; __entry->sync_mode = wbc->sync_mode; __entry->cgroup_ino = __trace_wbc_assign_cgroup(wbc); @@ -361,9 +360,7 @@ DECLARE_EVENT_CLASS(writeback_work_class, __field(ino_t, cgroup_ino) ), TP_fast_assign( - strscpy_pad(__entry->name, - wb->bdi->dev ? dev_name(wb->bdi->dev) : - "(unknown)", 32); + strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32); __entry->nr_pages = work->nr_pages; __entry->sb_dev = work->sb ? work->sb->s_dev : 0; __entry->sync_mode = work->sync_mode; @@ -416,7 +413,7 @@ DECLARE_EVENT_CLASS(writeback_class, __field(ino_t, cgroup_ino) ), TP_fast_assign( - strscpy_pad(__entry->name, dev_name(wb->bdi->dev), 32); + strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32); __entry->cgroup_ino = __trace_wb_assign_cgroup(wb); ), TP_printk("bdi %s: cgroup_ino=%lu", @@ -438,7 +435,7 @@ TRACE_EVENT(writeback_bdi_register, __array(char, name, 32) ), TP_fast_assign( - strscpy_pad(__entry->name, dev_name(bdi->dev), 32); + strscpy_pad(__entry->name, bdi_dev_name(bdi), 32); ), TP_printk("bdi %s", __entry->name @@ -463,7 +460,7 @@ DECLARE_EVENT_CLASS(wbc_class, ), TP_fast_assign( - strscpy_pad(__entry->name, dev_name(bdi->dev), 32); + strscpy_pad(__entry->name, bdi_dev_name(bdi), 32); __entry->nr_to_write = wbc->nr_to_write; __entry->pages_skipped = wbc->pages_skipped; __entry->sync_mode = wbc->sync_mode; @@ -514,7 +511,7 @@ TRACE_EVENT(writeback_queue_io, ), TP_fast_assign( unsigned long *older_than_this = work->older_than_this; - strscpy_pad(__entry->name, dev_name(wb->bdi->dev), 32); + strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32); __entry->older = older_than_this ? *older_than_this : 0; __entry->age = older_than_this ? (jiffies - *older_than_this) * 1000 / HZ : -1; @@ -600,7 +597,7 @@ TRACE_EVENT(bdi_dirty_ratelimit, ), TP_fast_assign( - strscpy_pad(__entry->bdi, dev_name(wb->bdi->dev), 32); + strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32); __entry->write_bw = KBps(wb->write_bandwidth); __entry->avg_write_bw = KBps(wb->avg_write_bandwidth); __entry->dirty_rate = KBps(dirty_rate); @@ -665,7 +662,7 @@ TRACE_EVENT(balance_dirty_pages, TP_fast_assign( unsigned long freerun = (thresh + bg_thresh) / 2; - strscpy_pad(__entry->bdi, dev_name(wb->bdi->dev), 32); + strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32); __entry->limit = global_wb_domain.dirty_limit; __entry->setpoint = (global_wb_domain.dirty_limit + @@ -726,7 +723,7 @@ TRACE_EVENT(writeback_sb_inodes_requeue, TP_fast_assign( strscpy_pad(__entry->name, - dev_name(inode_to_bdi(inode)->dev), 32); + bdi_dev_name(inode_to_bdi(inode)), 32); __entry->ino = inode->i_ino; __entry->state = inode->i_state; __entry->dirtied_when = inode->dirtied_when; @@ -800,7 +797,7 @@ DECLARE_EVENT_CLASS(writeback_single_inode_template, TP_fast_assign( strscpy_pad(__entry->name, - dev_name(inode_to_bdi(inode)->dev), 32); + bdi_dev_name(inode_to_bdi(inode)), 32); __entry->ino = inode->i_ino; __entry->state = inode->i_state; __entry->dirtied_when = inode->dirtied_when; diff --git a/include/uapi/linux/swab.h b/include/uapi/linux/swab.h index 23cd84868cc3..fa7f97da5b76 100644 --- a/include/uapi/linux/swab.h +++ b/include/uapi/linux/swab.h @@ -4,6 +4,7 @@ #include <linux/types.h> #include <linux/compiler.h> +#include <asm/bitsperlong.h> #include <asm/swab.h> /* @@ -132,6 +133,15 @@ static inline __attribute_const__ __u32 __fswahb32(__u32 val) __fswab64(x)) #endif +static __always_inline unsigned long __swab(const unsigned long y) +{ +#if BITS_PER_LONG == 64 + return __swab64(y); +#else /* BITS_PER_LONG == 32 */ + return __swab32(y); +#endif +} + /** * __swahw32 - return a word-swapped 32-bit value * @x: value to wordswap diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 87aa2a6d9125..27c1ed2822e6 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -195,7 +195,7 @@ enum VM_MIN_UNMAPPED=32, /* Set min percent of unmapped pages */ VM_PANIC_ON_OOM=33, /* panic at out-of-memory */ VM_VDSO_ENABLED=34, /* map VDSO into new processes? */ - VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */ + VM_MIN_SLAB=35, /* Percent pages ignored by node reclaim */ }; diff --git a/init/main.c b/init/main.c index db13a76c036e..d8c7e86c2d28 100644 --- a/init/main.c +++ b/init/main.c @@ -246,8 +246,7 @@ static int __init loglevel(char *str) early_param("loglevel", loglevel); /* Change NUL term back to "=", to make "param" the whole string. */ -static int __init repair_env_string(char *param, char *val, - const char *unused, void *arg) +static void __init repair_env_string(char *param, char *val) { if (val) { /* param=val or param="val"? */ @@ -256,11 +255,9 @@ static int __init repair_env_string(char *param, char *val, else if (val == param+strlen(param)+2) { val[-2] = '='; memmove(val-1, val, strlen(val)+1); - val--; } else BUG(); } - return 0; } /* Anything after -- gets handed straight to init. */ @@ -272,7 +269,7 @@ static int __init set_init_arg(char *param, char *val, if (panic_later) return 0; - repair_env_string(param, val, unused, NULL); + repair_env_string(param, val); for (i = 0; argv_init[i]; i++) { if (i == MAX_INIT_ARGS) { @@ -292,14 +289,16 @@ static int __init set_init_arg(char *param, char *val, static int __init unknown_bootoption(char *param, char *val, const char *unused, void *arg) { - repair_env_string(param, val, unused, NULL); + size_t len = strlen(param); + + repair_env_string(param, val); /* Handle obsolete-style parameters */ if (obsolete_checksetup(param)) return 0; /* Unused module parameter. */ - if (strchr(param, '.') && (!val || strchr(param, '.') < val)) + if (strnchr(param, len, '.')) return 0; if (panic_later) @@ -313,7 +312,7 @@ static int __init unknown_bootoption(char *param, char *val, panic_later = "env"; panic_param = param; } - if (!strncmp(param, envp_init[i], val - param)) + if (!strncmp(param, envp_init[i], len+1)) break; } envp_init[i] = param; @@ -991,6 +990,12 @@ static const char *initcall_level_names[] __initdata = { "late", }; +static int __init ignore_unknown_bootoption(char *param, char *val, + const char *unused, void *arg) +{ + return 0; +} + static void __init do_initcall_level(int level) { initcall_entry_t *fn; @@ -1000,7 +1005,7 @@ static void __init do_initcall_level(int level) initcall_command_line, __start___param, __stop___param - __start___param, level, level, - NULL, &repair_env_string); + NULL, ignore_unknown_bootoption); trace_initcall_level(initcall_level_names[level]); for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++) @@ -1043,8 +1048,16 @@ static void __init do_pre_smp_initcalls(void) static int run_init_process(const char *init_filename) { + const char *const *p; + argv_init[0] = init_filename; pr_info("Run %s as init process\n", init_filename); + pr_debug(" with arguments:\n"); + for (p = argv_init; *p; p++) + pr_debug(" %s\n", *p); + pr_debug(" with environment:\n"); + for (p = envp_init; *p; p++) + pr_debug(" %s\n", *p); return do_execve(getname_kernel(init_filename), (const char __user *const __user *)argv_init, (const char __user *const __user *)envp_init); @@ -1091,6 +1104,11 @@ static void mark_readonly(void) } else pr_info("Kernel memory protection disabled.\n"); } +#elif defined(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX) +static inline void mark_readonly(void) +{ + pr_warn("Kernel memory protection not selected by kernel config.\n"); +} #else static inline void mark_readonly(void) { diff --git a/kernel/Makefile b/kernel/Makefile index f2cc0d118a0b..4cb4130ced32 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -27,6 +27,7 @@ KCOV_INSTRUMENT_softirq.o := n # and produce insane amounts of uninteresting coverage. KCOV_INSTRUMENT_module.o := n KCOV_INSTRUMENT_extable.o := n +KCOV_INSTRUMENT_stacktrace.o := n # Don't self-instrument. KCOV_INSTRUMENT_kcov.o := n KASAN_SANITIZE_kcov.o := n diff --git a/lib/Kconfig b/lib/Kconfig index 6e790dc55c5b..bc7e56370129 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -278,6 +278,13 @@ config ZLIB_DEFLATE tristate select BITREVERSE +config ZLIB_DFLTCC + def_bool y + depends on S390 + prompt "Enable s390x DEFLATE CONVERSION CALL support for kernel zlib" + help + Enable s390x hardware support for zlib in the kernel. + config LZO_COMPRESS tristate diff --git a/lib/Makefile b/lib/Makefile index c20b1debe9b4..23ca78d43d24 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -16,6 +16,7 @@ KCOV_INSTRUMENT_rbtree.o := n KCOV_INSTRUMENT_list_debug.o := n KCOV_INSTRUMENT_debugobjects.o := n KCOV_INSTRUMENT_dynamic_debug.o := n +KCOV_INSTRUMENT_fault-inject.o := n # Early boot use of cmdline, don't instrument it ifdef CONFIG_AMD_MEM_ENCRYPT @@ -140,6 +141,7 @@ obj-$(CONFIG_842_COMPRESS) += 842/ obj-$(CONFIG_842_DECOMPRESS) += 842/ obj-$(CONFIG_ZLIB_INFLATE) += zlib_inflate/ obj-$(CONFIG_ZLIB_DEFLATE) += zlib_deflate/ +obj-$(CONFIG_ZLIB_DFLTCC) += zlib_dfltcc/ obj-$(CONFIG_REED_SOLOMON) += reed_solomon/ obj-$(CONFIG_BCH) += bch.o obj-$(CONFIG_LZO_COMPRESS) += lzo/ diff --git a/lib/decompress_inflate.c b/lib/decompress_inflate.c index 63b4b7eee138..6130c42b8e59 100644 --- a/lib/decompress_inflate.c +++ b/lib/decompress_inflate.c @@ -10,6 +10,10 @@ #include "zlib_inflate/inftrees.c" #include "zlib_inflate/inffast.c" #include "zlib_inflate/inflate.c" +#ifdef CONFIG_ZLIB_DFLTCC +#include "zlib_dfltcc/dfltcc.c" +#include "zlib_dfltcc/dfltcc_inflate.c" +#endif #else /* STATIC */ /* initramfs et al: linked */ @@ -76,7 +80,12 @@ STATIC int INIT __gunzip(unsigned char *buf, long len, } strm->workspace = malloc(flush ? zlib_inflate_workspacesize() : +#ifdef CONFIG_ZLIB_DFLTCC + /* Always allocate the full workspace for DFLTCC */ + zlib_inflate_workspacesize()); +#else sizeof(struct inflate_state)); +#endif if (strm->workspace == NULL) { error("Out of memory while allocating workspace"); goto gunzip_nomem4; @@ -123,10 +132,14 @@ STATIC int INIT __gunzip(unsigned char *buf, long len, rc = zlib_inflateInit2(strm, -MAX_WBITS); +#ifdef CONFIG_ZLIB_DFLTCC + /* Always keep the window for DFLTCC */ +#else if (!flush) { WS(strm)->inflate_state.wsize = 0; WS(strm)->inflate_state.window = NULL; } +#endif while (rc == Z_OK) { if (strm->avail_in == 0) { diff --git a/lib/find_bit.c b/lib/find_bit.c index e35a76b291e6..49f875f1baf7 100644 --- a/lib/find_bit.c +++ b/lib/find_bit.c @@ -17,9 +17,9 @@ #include <linux/export.h> #include <linux/kernel.h> -#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \ - !defined(find_next_and_bit) - +#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \ + !defined(find_next_bit_le) || !defined(find_next_zero_bit_le) || \ + !defined(find_next_and_bit) /* * This is a common helper function for find_next_bit, find_next_zero_bit, and * find_next_and_bit. The differences are: @@ -27,11 +27,11 @@ * searching it for one bits. * - The optional "addr2", which is anded with "addr1" if present. */ -static inline unsigned long _find_next_bit(const unsigned long *addr1, +static unsigned long _find_next_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long nbits, - unsigned long start, unsigned long invert) + unsigned long start, unsigned long invert, unsigned long le) { - unsigned long tmp; + unsigned long tmp, mask; if (unlikely(start >= nbits)) return nbits; @@ -42,7 +42,12 @@ static inline unsigned long _find_next_bit(const unsigned long *addr1, tmp ^= invert; /* Handle 1st word. */ - tmp &= BITMAP_FIRST_WORD_MASK(start); + mask = BITMAP_FIRST_WORD_MASK(start); + if (le) + mask = swab(mask); + + tmp &= mask; + start = round_down(start, BITS_PER_LONG); while (!tmp) { @@ -56,6 +61,9 @@ static inline unsigned long _find_next_bit(const unsigned long *addr1, tmp ^= invert; } + if (le) + tmp = swab(tmp); + return min(start + __ffs(tmp), nbits); } #endif @@ -67,7 +75,7 @@ static inline unsigned long _find_next_bit(const unsigned long *addr1, unsigned long find_next_bit(const unsigned long *addr, unsigned long size, unsigned long offset) { - return _find_next_bit(addr, NULL, size, offset, 0UL); + return _find_next_bit(addr, NULL, size, offset, 0UL, 0); } EXPORT_SYMBOL(find_next_bit); #endif @@ -76,7 +84,7 @@ EXPORT_SYMBOL(find_next_bit); unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, unsigned long offset) { - return _find_next_bit(addr, NULL, size, offset, ~0UL); + return _find_next_bit(addr, NULL, size, offset, ~0UL, 0); } EXPORT_SYMBOL(find_next_zero_bit); #endif @@ -86,7 +94,7 @@ unsigned long find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long size, unsigned long offset) { - return _find_next_bit(addr1, addr2, size, offset, 0UL); + return _find_next_bit(addr1, addr2, size, offset, 0UL, 0); } EXPORT_SYMBOL(find_next_and_bit); #endif @@ -149,57 +157,11 @@ EXPORT_SYMBOL(find_last_bit); #ifdef __BIG_ENDIAN -/* include/linux/byteorder does not support "unsigned long" type */ -static inline unsigned long ext2_swab(const unsigned long y) -{ -#if BITS_PER_LONG == 64 - return (unsigned long) __swab64((u64) y); -#elif BITS_PER_LONG == 32 - return (unsigned long) __swab32((u32) y); -#else -#error BITS_PER_LONG not defined -#endif -} - -#if !defined(find_next_bit_le) || !defined(find_next_zero_bit_le) -static inline unsigned long _find_next_bit_le(const unsigned long *addr1, - const unsigned long *addr2, unsigned long nbits, - unsigned long start, unsigned long invert) -{ - unsigned long tmp; - - if (unlikely(start >= nbits)) - return nbits; - - tmp = addr1[start / BITS_PER_LONG]; - if (addr2) - tmp &= addr2[start / BITS_PER_LONG]; - tmp ^= invert; - - /* Handle 1st word. */ - tmp &= ext2_swab(BITMAP_FIRST_WORD_MASK(start)); - start = round_down(start, BITS_PER_LONG); - - while (!tmp) { - start += BITS_PER_LONG; - if (start >= nbits) - return nbits; - - tmp = addr1[start / BITS_PER_LONG]; - if (addr2) - tmp &= addr2[start / BITS_PER_LONG]; - tmp ^= invert; - } - - return min(start + __ffs(ext2_swab(tmp)), nbits); -} -#endif - #ifndef find_next_zero_bit_le unsigned long find_next_zero_bit_le(const void *addr, unsigned long size, unsigned long offset) { - return _find_next_bit_le(addr, NULL, size, offset, ~0UL); + return _find_next_bit(addr, NULL, size, offset, ~0UL, 1); } EXPORT_SYMBOL(find_next_zero_bit_le); #endif @@ -208,7 +170,7 @@ EXPORT_SYMBOL(find_next_zero_bit_le); unsigned long find_next_bit_le(const void *addr, unsigned long size, unsigned long offset) { - return _find_next_bit_le(addr, NULL, size, offset, 0UL); + return _find_next_bit(addr, NULL, size, offset, 0UL, 1); } EXPORT_SYMBOL(find_next_bit_le); #endif diff --git a/lib/scatterlist.c b/lib/scatterlist.c index c2cf2c311b7d..5813072bc589 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -311,7 +311,7 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents, if (prv) table->nents = ++table->orig_nents; - return -ENOMEM; + return -ENOMEM; } sg_init_table(sg, alloc_size); diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c index e14a15ac250b..71ec3afe1681 100644 --- a/lib/test_bitmap.c +++ b/lib/test_bitmap.c @@ -275,22 +275,23 @@ static void __init test_copy(void) static void __init test_replace(void) { unsigned int nbits = 64; + unsigned int nlongs = DIV_ROUND_UP(nbits, BITS_PER_LONG); DECLARE_BITMAP(bmap, 1024); bitmap_zero(bmap, 1024); - bitmap_replace(bmap, &exp2[0], &exp2[1], exp2_to_exp3_mask, nbits); + bitmap_replace(bmap, &exp2[0 * nlongs], &exp2[1 * nlongs], exp2_to_exp3_mask, nbits); expect_eq_bitmap(bmap, exp3_0_1, nbits); bitmap_zero(bmap, 1024); - bitmap_replace(bmap, &exp2[1], &exp2[0], exp2_to_exp3_mask, nbits); + bitmap_replace(bmap, &exp2[1 * nlongs], &exp2[0 * nlongs], exp2_to_exp3_mask, nbits); expect_eq_bitmap(bmap, exp3_1_0, nbits); bitmap_fill(bmap, 1024); - bitmap_replace(bmap, &exp2[0], &exp2[1], exp2_to_exp3_mask, nbits); + bitmap_replace(bmap, &exp2[0 * nlongs], &exp2[1 * nlongs], exp2_to_exp3_mask, nbits); expect_eq_bitmap(bmap, exp3_0_1, nbits); bitmap_fill(bmap, 1024); - bitmap_replace(bmap, &exp2[1], &exp2[0], exp2_to_exp3_mask, nbits); + bitmap_replace(bmap, &exp2[1 * nlongs], &exp2[0 * nlongs], exp2_to_exp3_mask, nbits); expect_eq_bitmap(bmap, exp3_1_0, nbits); } diff --git a/lib/test_kasan.c b/lib/test_kasan.c index 328d33beae36..3872d250ed2c 100644 --- a/lib/test_kasan.c +++ b/lib/test_kasan.c @@ -158,6 +158,7 @@ static noinline void __init kmalloc_oob_krealloc_more(void) if (!ptr1 || !ptr2) { pr_err("Allocation failed\n"); kfree(ptr1); + kfree(ptr2); return; } diff --git a/lib/zlib_deflate/deflate.c b/lib/zlib_deflate/deflate.c index d20ef458f137..8a878d0d892c 100644 --- a/lib/zlib_deflate/deflate.c +++ b/lib/zlib_deflate/deflate.c @@ -52,16 +52,19 @@ #include <linux/zutil.h> #include "defutil.h" +/* architecture-specific bits */ +#ifdef CONFIG_ZLIB_DFLTCC +# include "../zlib_dfltcc/dfltcc.h" +#else +#define DEFLATE_RESET_HOOK(strm) do {} while (0) +#define DEFLATE_HOOK(strm, flush, bstate) 0 +#define DEFLATE_NEED_CHECKSUM(strm) 1 +#define DEFLATE_DFLTCC_ENABLED() 0 +#endif /* =========================================================================== * Function prototypes. */ -typedef enum { - need_more, /* block not completed, need more input or more output */ - block_done, /* block flush performed */ - finish_started, /* finish started, need only more output at next deflate */ - finish_done /* finish done, accept no more input or output */ -} block_state; typedef block_state (*compress_func) (deflate_state *s, int flush); /* Compression function. Returns the block state after the call. */ @@ -72,7 +75,6 @@ static block_state deflate_fast (deflate_state *s, int flush); static block_state deflate_slow (deflate_state *s, int flush); static void lm_init (deflate_state *s); static void putShortMSB (deflate_state *s, uInt b); -static void flush_pending (z_streamp strm); static int read_buf (z_streamp strm, Byte *buf, unsigned size); static uInt longest_match (deflate_state *s, IPos cur_match); @@ -98,6 +100,25 @@ static void check_match (deflate_state *s, IPos start, IPos match, * See deflate.c for comments about the MIN_MATCH+1. */ +/* Workspace to be allocated for deflate processing */ +typedef struct deflate_workspace { + /* State memory for the deflator */ + deflate_state deflate_memory; +#ifdef CONFIG_ZLIB_DFLTCC + /* State memory for s390 hardware deflate */ + struct dfltcc_state dfltcc_memory; +#endif + Byte *window_memory; + Pos *prev_memory; + Pos *head_memory; + char *overlay_memory; +} deflate_workspace; + +#ifdef CONFIG_ZLIB_DFLTCC +/* dfltcc_state must be doubleword aligned for DFLTCC call */ +static_assert(offsetof(struct deflate_workspace, dfltcc_memory) % 8 == 0); +#endif + /* Values for max_lazy_match, good_match and max_chain_length, depending on * the desired pack level (0..9). The values given below have been tuned to * exclude worst case performance for pathological files. Better values may be @@ -207,7 +228,15 @@ int zlib_deflateInit2( */ next = (char *) mem; next += sizeof(*mem); +#ifdef CONFIG_ZLIB_DFLTCC + /* + * DFLTCC requires the window to be page aligned. + * Thus, we overallocate and take the aligned portion of the buffer. + */ + mem->window_memory = (Byte *) PTR_ALIGN(next, PAGE_SIZE); +#else mem->window_memory = (Byte *) next; +#endif next += zlib_deflate_window_memsize(windowBits); mem->prev_memory = (Pos *) next; next += zlib_deflate_prev_memsize(windowBits); @@ -277,6 +306,8 @@ int zlib_deflateReset( zlib_tr_init(s); lm_init(s); + DEFLATE_RESET_HOOK(strm); + return Z_OK; } @@ -294,35 +325,6 @@ static void putShortMSB( put_byte(s, (Byte)(b & 0xff)); } -/* ========================================================================= - * Flush as much pending output as possible. All deflate() output goes - * through this function so some applications may wish to modify it - * to avoid allocating a large strm->next_out buffer and copying into it. - * (See also read_buf()). - */ -static void flush_pending( - z_streamp strm -) -{ - deflate_state *s = (deflate_state *) strm->state; - unsigned len = s->pending; - - if (len > strm->avail_out) len = strm->avail_out; - if (len == 0) return; - - if (strm->next_out != NULL) { - memcpy(strm->next_out, s->pending_out, len); - strm->next_out += len; - } - s->pending_out += len; - strm->total_out += len; - strm->avail_out -= len; - s->pending -= len; - if (s->pending == 0) { - s->pending_out = s->pending_buf; - } -} - /* ========================================================================= */ int zlib_deflate( z_streamp strm, @@ -404,7 +406,8 @@ int zlib_deflate( (flush != Z_NO_FLUSH && s->status != FINISH_STATE)) { block_state bstate; - bstate = (*(configuration_table[s->level].func))(s, flush); + bstate = DEFLATE_HOOK(strm, flush, &bstate) ? bstate : + (*(configuration_table[s->level].func))(s, flush); if (bstate == finish_started || bstate == finish_done) { s->status = FINISH_STATE; @@ -503,7 +506,8 @@ static int read_buf( strm->avail_in -= len; - if (!((deflate_state *)(strm->state))->noheader) { + if (!DEFLATE_NEED_CHECKSUM(strm)) {} + else if (!((deflate_state *)(strm->state))->noheader) { strm->adler = zlib_adler32(strm->adler, strm->next_in, len); } memcpy(buf, strm->next_in, len); @@ -1135,3 +1139,8 @@ int zlib_deflate_workspacesize(int windowBits, int memLevel) + zlib_deflate_head_memsize(memLevel) + zlib_deflate_overlay_memsize(memLevel); } + +int zlib_deflate_dfltcc_enabled(void) +{ + return DEFLATE_DFLTCC_ENABLED(); +} diff --git a/lib/zlib_deflate/deflate_syms.c b/lib/zlib_deflate/deflate_syms.c index 72fe4b73be53..24b740b99678 100644 --- a/lib/zlib_deflate/deflate_syms.c +++ b/lib/zlib_deflate/deflate_syms.c @@ -12,6 +12,7 @@ #include <linux/zlib.h> EXPORT_SYMBOL(zlib_deflate_workspacesize); +EXPORT_SYMBOL(zlib_deflate_dfltcc_enabled); EXPORT_SYMBOL(zlib_deflate); EXPORT_SYMBOL(zlib_deflateInit2); EXPORT_SYMBOL(zlib_deflateEnd); diff --git a/lib/zlib_deflate/deftree.c b/lib/zlib_deflate/deftree.c index 9b1756b12743..a4a34da512fe 100644 --- a/lib/zlib_deflate/deftree.c +++ b/lib/zlib_deflate/deftree.c @@ -76,11 +76,6 @@ static const uch bl_order[BL_CODES] * probability, to avoid transmitting the lengths for unused bit length codes. */ -#define Buf_size (8 * 2*sizeof(char)) -/* Number of bits used within bi_buf. (bi_buf might be implemented on - * more than 16 bits on some systems.) - */ - /* =========================================================================== * Local data. These are initialized only once. */ @@ -147,7 +142,6 @@ static void send_all_trees (deflate_state *s, int lcodes, int dcodes, static void compress_block (deflate_state *s, ct_data *ltree, ct_data *dtree); static void set_data_type (deflate_state *s); -static void bi_windup (deflate_state *s); static void bi_flush (deflate_state *s); static void copy_block (deflate_state *s, char *buf, unsigned len, int header); @@ -170,54 +164,6 @@ static void copy_block (deflate_state *s, char *buf, unsigned len, */ /* =========================================================================== - * Send a value on a given number of bits. - * IN assertion: length <= 16 and value fits in length bits. - */ -#ifdef DEBUG_ZLIB -static void send_bits (deflate_state *s, int value, int length); - -static void send_bits( - deflate_state *s, - int value, /* value to send */ - int length /* number of bits */ -) -{ - Tracevv((stderr," l %2d v %4x ", length, value)); - Assert(length > 0 && length <= 15, "invalid length"); - s->bits_sent += (ulg)length; - - /* If not enough room in bi_buf, use (valid) bits from bi_buf and - * (16 - bi_valid) bits from value, leaving (width - (16-bi_valid)) - * unused bits in value. - */ - if (s->bi_valid > (int)Buf_size - length) { - s->bi_buf |= (value << s->bi_valid); - put_short(s, s->bi_buf); - s->bi_buf = (ush)value >> (Buf_size - s->bi_valid); - s->bi_valid += length - Buf_size; - } else { - s->bi_buf |= value << s->bi_valid; - s->bi_valid += length; - } -} -#else /* !DEBUG_ZLIB */ - -#define send_bits(s, value, length) \ -{ int len = length;\ - if (s->bi_valid > (int)Buf_size - len) {\ - int val = value;\ - s->bi_buf |= (val << s->bi_valid);\ - put_short(s, s->bi_buf);\ - s->bi_buf = (ush)val >> (Buf_size - s->bi_valid);\ - s->bi_valid += len - Buf_size;\ - } else {\ - s->bi_buf |= (value) << s->bi_valid;\ - s->bi_valid += len;\ - }\ -} -#endif /* DEBUG_ZLIB */ - -/* =========================================================================== * Initialize the various 'constant' tables. In a multi-threaded environment, * this function may be called by two threads concurrently, but this is * harmless since both invocations do exactly the same thing. diff --git a/lib/zlib_deflate/defutil.h b/lib/zlib_deflate/defutil.h index a8c370897c9f..385333b22ec6 100644 --- a/lib/zlib_deflate/defutil.h +++ b/lib/zlib_deflate/defutil.h @@ -1,5 +1,7 @@ +#ifndef DEFUTIL_H +#define DEFUTIL_H - +#include <linux/zutil.h> #define Assert(err, str) #define Trace(dummy) @@ -238,17 +240,13 @@ typedef struct deflate_state { } deflate_state; -typedef struct deflate_workspace { - /* State memory for the deflator */ - deflate_state deflate_memory; - Byte *window_memory; - Pos *prev_memory; - Pos *head_memory; - char *overlay_memory; -} deflate_workspace; - +#ifdef CONFIG_ZLIB_DFLTCC +#define zlib_deflate_window_memsize(windowBits) \ + (2 * (1 << (windowBits)) * sizeof(Byte) + PAGE_SIZE) +#else #define zlib_deflate_window_memsize(windowBits) \ (2 * (1 << (windowBits)) * sizeof(Byte)) +#endif #define zlib_deflate_prev_memsize(windowBits) \ ((1 << (windowBits)) * sizeof(Pos)) #define zlib_deflate_head_memsize(memLevel) \ @@ -293,6 +291,24 @@ void zlib_tr_stored_type_only (deflate_state *); } /* =========================================================================== + * Reverse the first len bits of a code, using straightforward code (a faster + * method would use a table) + * IN assertion: 1 <= len <= 15 + */ +static inline unsigned bi_reverse( + unsigned code, /* the value to invert */ + int len /* its bit length */ +) +{ + register unsigned res = 0; + do { + res |= code & 1; + code >>= 1, res <<= 1; + } while (--len > 0); + return res >> 1; +} + +/* =========================================================================== * Flush the bit buffer, keeping at most 7 bits in it. */ static inline void bi_flush(deflate_state *s) @@ -325,3 +341,101 @@ static inline void bi_windup(deflate_state *s) #endif } +typedef enum { + need_more, /* block not completed, need more input or more output */ + block_done, /* block flush performed */ + finish_started, /* finish started, need only more output at next deflate */ + finish_done /* finish done, accept no more input or output */ +} block_state; + +#define Buf_size (8 * 2*sizeof(char)) +/* Number of bits used within bi_buf. (bi_buf might be implemented on + * more than 16 bits on some systems.) + */ + +/* =========================================================================== + * Send a value on a given number of bits. + * IN assertion: length <= 16 and value fits in length bits. + */ +#ifdef DEBUG_ZLIB +static void send_bits (deflate_state *s, int value, int length); + +static void send_bits( + deflate_state *s, + int value, /* value to send */ + int length /* number of bits */ +) +{ + Tracevv((stderr," l %2d v %4x ", length, value)); + Assert(length > 0 && length <= 15, "invalid length"); + s->bits_sent += (ulg)length; + + /* If not enough room in bi_buf, use (valid) bits from bi_buf and + * (16 - bi_valid) bits from value, leaving (width - (16-bi_valid)) + * unused bits in value. + */ + if (s->bi_valid > (int)Buf_size - length) { + s->bi_buf |= (value << s->bi_valid); + put_short(s, s->bi_buf); + s->bi_buf = (ush)value >> (Buf_size - s->bi_valid); + s->bi_valid += length - Buf_size; + } else { + s->bi_buf |= value << s->bi_valid; + s->bi_valid += length; + } +} +#else /* !DEBUG_ZLIB */ + +#define send_bits(s, value, length) \ +{ int len = length;\ + if (s->bi_valid > (int)Buf_size - len) {\ + int val = value;\ + s->bi_buf |= (val << s->bi_valid);\ + put_short(s, s->bi_buf);\ + s->bi_buf = (ush)val >> (Buf_size - s->bi_valid);\ + s->bi_valid += len - Buf_size;\ + } else {\ + s->bi_buf |= (value) << s->bi_valid;\ + s->bi_valid += len;\ + }\ +} +#endif /* DEBUG_ZLIB */ + +static inline void zlib_tr_send_bits( + deflate_state *s, + int value, + int length +) +{ + send_bits(s, value, length); +} + +/* ========================================================================= + * Flush as much pending output as possible. All deflate() output goes + * through this function so some applications may wish to modify it + * to avoid allocating a large strm->next_out buffer and copying into it. + * (See also read_buf()). + */ +static inline void flush_pending( + z_streamp strm +) +{ + deflate_state *s = (deflate_state *) strm->state; + unsigned len = s->pending; + + if (len > strm->avail_out) len = strm->avail_out; + if (len == 0) return; + + if (strm->next_out != NULL) { + memcpy(strm->next_out, s->pending_out, len); + strm->next_out += len; + } + s->pending_out += len; + strm->total_out += len; + strm->avail_out -= len; + s->pending -= len; + if (s->pending == 0) { + s->pending_out = s->pending_buf; + } +} +#endif /* DEFUTIL_H */ diff --git a/lib/zlib_dfltcc/Makefile b/lib/zlib_dfltcc/Makefile new file mode 100644 index 000000000000..8e4d5afbbb10 --- /dev/null +++ b/lib/zlib_dfltcc/Makefile @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: GPL-2.0-only +# +# This is a modified version of zlib, which does all memory +# allocation ahead of time. +# +# This is the code for s390 zlib hardware support. +# + +obj-$(CONFIG_ZLIB_DFLTCC) += zlib_dfltcc.o + +zlib_dfltcc-objs := dfltcc.o dfltcc_deflate.o dfltcc_inflate.o dfltcc_syms.o diff --git a/lib/zlib_dfltcc/dfltcc.c b/lib/zlib_dfltcc/dfltcc.c new file mode 100644 index 000000000000..c30de430b30c --- /dev/null +++ b/lib/zlib_dfltcc/dfltcc.c @@ -0,0 +1,55 @@ +// SPDX-License-Identifier: Zlib +/* dfltcc.c - SystemZ DEFLATE CONVERSION CALL support. */ + +#include <linux/zutil.h> +#include "dfltcc_util.h" +#include "dfltcc.h" + +char *oesc_msg( + char *buf, + int oesc +) +{ + if (oesc == 0x00) + return NULL; /* Successful completion */ + else { +#ifdef STATIC + return NULL; /* Ignore for pre-boot decompressor */ +#else + sprintf(buf, "Operation-Ending-Supplemental Code is 0x%.2X", oesc); + return buf; +#endif + } +} + +void dfltcc_reset( + z_streamp strm, + uInt size +) +{ + struct dfltcc_state *dfltcc_state = + (struct dfltcc_state *)((char *)strm->state + size); + struct dfltcc_qaf_param *param = + (struct dfltcc_qaf_param *)&dfltcc_state->param; + + /* Initialize available functions */ + if (is_dfltcc_enabled()) { + dfltcc(DFLTCC_QAF, param, NULL, NULL, NULL, NULL, NULL); + memmove(&dfltcc_state->af, param, sizeof(dfltcc_state->af)); + } else + memset(&dfltcc_state->af, 0, sizeof(dfltcc_state->af)); + + /* Initialize parameter block */ + memset(&dfltcc_state->param, 0, sizeof(dfltcc_state->param)); + dfltcc_state->param.nt = 1; + + /* Initialize tuning parameters */ + if (zlib_dfltcc_support == ZLIB_DFLTCC_FULL_DEBUG) + dfltcc_state->level_mask = DFLTCC_LEVEL_MASK_DEBUG; + else + dfltcc_state->level_mask = DFLTCC_LEVEL_MASK; + dfltcc_state->block_size = DFLTCC_BLOCK_SIZE; + dfltcc_state->block_threshold = DFLTCC_FIRST_FHT_BLOCK_SIZE; + dfltcc_state->dht_threshold = DFLTCC_DHT_MIN_SAMPLE_SIZE; + dfltcc_state->param.ribm = DFLTCC_RIBM; +} diff --git a/lib/zlib_dfltcc/dfltcc.h b/lib/zlib_dfltcc/dfltcc.h new file mode 100644 index 000000000000..2a2fac1d050a --- /dev/null +++ b/lib/zlib_dfltcc/dfltcc.h @@ -0,0 +1,155 @@ +// SPDX-License-Identifier: Zlib +#ifndef DFLTCC_H +#define DFLTCC_H + +#include "../zlib_deflate/defutil.h" +#include <asm/facility.h> +#include <asm/setup.h> + +/* + * Tuning parameters. + */ +#define DFLTCC_LEVEL_MASK 0x2 /* DFLTCC compression for level 1 only */ +#define DFLTCC_LEVEL_MASK_DEBUG 0x3fe /* DFLTCC compression for all levels */ +#define DFLTCC_BLOCK_SIZE 1048576 +#define DFLTCC_FIRST_FHT_BLOCK_SIZE 4096 +#define DFLTCC_DHT_MIN_SAMPLE_SIZE 4096 +#define DFLTCC_RIBM 0 + +#define DFLTCC_FACILITY 151 + +/* + * Parameter Block for Query Available Functions. + */ +struct dfltcc_qaf_param { + char fns[16]; + char reserved1[8]; + char fmts[2]; + char reserved2[6]; +}; + +static_assert(sizeof(struct dfltcc_qaf_param) == 32); + +#define DFLTCC_FMT0 0 + +/* + * Parameter Block for Generate Dynamic-Huffman Table, Compress and Expand. + */ +struct dfltcc_param_v0 { + uint16_t pbvn; /* Parameter-Block-Version Number */ + uint8_t mvn; /* Model-Version Number */ + uint8_t ribm; /* Reserved for IBM use */ + unsigned reserved32 : 31; + unsigned cf : 1; /* Continuation Flag */ + uint8_t reserved64[8]; + unsigned nt : 1; /* New Task */ + unsigned reserved129 : 1; + unsigned cvt : 1; /* Check Value Type */ + unsigned reserved131 : 1; + unsigned htt : 1; /* Huffman-Table Type */ + unsigned bcf : 1; /* Block-Continuation Flag */ + unsigned bcc : 1; /* Block Closing Control */ + unsigned bhf : 1; /* Block Header Final */ + unsigned reserved136 : 1; + unsigned reserved137 : 1; + unsigned dhtgc : 1; /* DHT Generation Control */ + unsigned reserved139 : 5; + unsigned reserved144 : 5; + unsigned sbb : 3; /* Sub-Byte Boundary */ + uint8_t oesc; /* Operation-Ending-Supplemental Code */ + unsigned reserved160 : 12; + unsigned ifs : 4; /* Incomplete-Function Status */ + uint16_t ifl; /* Incomplete-Function Length */ + uint8_t reserved192[8]; + uint8_t reserved256[8]; + uint8_t reserved320[4]; + uint16_t hl; /* History Length */ + unsigned reserved368 : 1; + uint16_t ho : 15; /* History Offset */ + uint32_t cv; /* Check Value */ + unsigned eobs : 15; /* End-of-block Symbol */ + unsigned reserved431: 1; + uint8_t eobl : 4; /* End-of-block Length */ + unsigned reserved436 : 12; + unsigned reserved448 : 4; + uint16_t cdhtl : 12; /* Compressed-Dynamic-Huffman Table + Length */ + uint8_t reserved464[6]; + uint8_t cdht[288]; + uint8_t reserved[32]; + uint8_t csb[1152]; +}; + +static_assert(sizeof(struct dfltcc_param_v0) == 1536); + +#define CVT_CRC32 0 +#define CVT_ADLER32 1 +#define HTT_FIXED 0 +#define HTT_DYNAMIC 1 + +/* + * Extension of inflate_state and deflate_state for DFLTCC. + */ +struct dfltcc_state { + struct dfltcc_param_v0 param; /* Parameter block */ + struct dfltcc_qaf_param af; /* Available functions */ + uLong level_mask; /* Levels on which to use DFLTCC */ + uLong block_size; /* New block each X bytes */ + uLong block_threshold; /* New block after total_in > X */ + uLong dht_threshold; /* New block only if avail_in >= X */ + char msg[64]; /* Buffer for strm->msg */ +}; + +/* Resides right after inflate_state or deflate_state */ +#define GET_DFLTCC_STATE(state) ((struct dfltcc_state *)((state) + 1)) + +/* External functions */ +int dfltcc_can_deflate(z_streamp strm); +int dfltcc_deflate(z_streamp strm, + int flush, + block_state *result); +void dfltcc_reset(z_streamp strm, uInt size); +int dfltcc_can_inflate(z_streamp strm); +typedef enum { + DFLTCC_INFLATE_CONTINUE, + DFLTCC_INFLATE_BREAK, + DFLTCC_INFLATE_SOFTWARE, +} dfltcc_inflate_action; +dfltcc_inflate_action dfltcc_inflate(z_streamp strm, + int flush, int *ret); +static inline int is_dfltcc_enabled(void) +{ +return (zlib_dfltcc_support != ZLIB_DFLTCC_DISABLED && + test_facility(DFLTCC_FACILITY)); +} + +#define DEFLATE_RESET_HOOK(strm) \ + dfltcc_reset((strm), sizeof(deflate_state)) + +#define DEFLATE_HOOK dfltcc_deflate + +#define DEFLATE_NEED_CHECKSUM(strm) (!dfltcc_can_deflate((strm))) + +#define DEFLATE_DFLTCC_ENABLED() is_dfltcc_enabled() + +#define INFLATE_RESET_HOOK(strm) \ + dfltcc_reset((strm), sizeof(struct inflate_state)) + +#define INFLATE_TYPEDO_HOOK(strm, flush) \ + if (dfltcc_can_inflate((strm))) { \ + dfltcc_inflate_action action; \ +\ + RESTORE(); \ + action = dfltcc_inflate((strm), (flush), &ret); \ + LOAD(); \ + if (action == DFLTCC_INFLATE_CONTINUE) \ + break; \ + else if (action == DFLTCC_INFLATE_BREAK) \ + goto inf_leave; \ + } + +#define INFLATE_NEED_CHECKSUM(strm) (!dfltcc_can_inflate((strm))) + +#define INFLATE_NEED_UPDATEWINDOW(strm) (!dfltcc_can_inflate((strm))) + +#endif /* DFLTCC_H */ diff --git a/lib/zlib_dfltcc/dfltcc_deflate.c b/lib/zlib_dfltcc/dfltcc_deflate.c new file mode 100644 index 000000000000..00c185101c6d --- /dev/null +++ b/lib/zlib_dfltcc/dfltcc_deflate.c @@ -0,0 +1,279 @@ +// SPDX-License-Identifier: Zlib + +#include "../zlib_deflate/defutil.h" +#include "dfltcc_util.h" +#include "dfltcc.h" +#include <asm/setup.h> +#include <linux/zutil.h> + +/* + * Compress. + */ +int dfltcc_can_deflate( + z_streamp strm +) +{ + deflate_state *state = (deflate_state *)strm->state; + struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state); + + /* Check for kernel dfltcc command line parameter */ + if (zlib_dfltcc_support == ZLIB_DFLTCC_DISABLED || + zlib_dfltcc_support == ZLIB_DFLTCC_INFLATE_ONLY) + return 0; + + /* Unsupported compression settings */ + if (!dfltcc_are_params_ok(state->level, state->w_bits, state->strategy, + dfltcc_state->level_mask)) + return 0; + + /* Unsupported hardware */ + if (!is_bit_set(dfltcc_state->af.fns, DFLTCC_GDHT) || + !is_bit_set(dfltcc_state->af.fns, DFLTCC_CMPR) || + !is_bit_set(dfltcc_state->af.fmts, DFLTCC_FMT0)) + return 0; + + return 1; +} + +static void dfltcc_gdht( + z_streamp strm +) +{ + deflate_state *state = (deflate_state *)strm->state; + struct dfltcc_param_v0 *param = &GET_DFLTCC_STATE(state)->param; + size_t avail_in = avail_in = strm->avail_in; + + dfltcc(DFLTCC_GDHT, + param, NULL, NULL, + &strm->next_in, &avail_in, NULL); +} + +static dfltcc_cc dfltcc_cmpr( + z_streamp strm +) +{ + deflate_state *state = (deflate_state *)strm->state; + struct dfltcc_param_v0 *param = &GET_DFLTCC_STATE(state)->param; + size_t avail_in = strm->avail_in; + size_t avail_out = strm->avail_out; + dfltcc_cc cc; + + cc = dfltcc(DFLTCC_CMPR | HBT_CIRCULAR, + param, &strm->next_out, &avail_out, + &strm->next_in, &avail_in, state->window); + strm->total_in += (strm->avail_in - avail_in); + strm->total_out += (strm->avail_out - avail_out); + strm->avail_in = avail_in; + strm->avail_out = avail_out; + return cc; +} + +static void send_eobs( + z_streamp strm, + const struct dfltcc_param_v0 *param +) +{ + deflate_state *state = (deflate_state *)strm->state; + + zlib_tr_send_bits( + state, + bi_reverse(param->eobs >> (15 - param->eobl), param->eobl), + param->eobl); + flush_pending(strm); + if (state->pending != 0) { + /* The remaining data is located in pending_out[0:pending]. If someone + * calls put_byte() - this might happen in deflate() - the byte will be + * placed into pending_buf[pending], which is incorrect. Move the + * remaining data to the beginning of pending_buf so that put_byte() is + * usable again. + */ + memmove(state->pending_buf, state->pending_out, state->pending); + state->pending_out = state->pending_buf; + } +#ifdef ZLIB_DEBUG + state->compressed_len += param->eobl; +#endif +} + +int dfltcc_deflate( + z_streamp strm, + int flush, + block_state *result +) +{ + deflate_state *state = (deflate_state *)strm->state; + struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state); + struct dfltcc_param_v0 *param = &dfltcc_state->param; + uInt masked_avail_in; + dfltcc_cc cc; + int need_empty_block; + int soft_bcc; + int no_flush; + + if (!dfltcc_can_deflate(strm)) + return 0; + +again: + masked_avail_in = 0; + soft_bcc = 0; + no_flush = flush == Z_NO_FLUSH; + + /* Trailing empty block. Switch to software, except when Continuation Flag + * is set, which means that DFLTCC has buffered some output in the + * parameter block and needs to be called again in order to flush it. + */ + if (flush == Z_FINISH && strm->avail_in == 0 && !param->cf) { + if (param->bcf) { + /* A block is still open, and the hardware does not support closing + * blocks without adding data. Thus, close it manually. + */ + send_eobs(strm, param); + param->bcf = 0; + } + return 0; + } + + if (strm->avail_in == 0 && !param->cf) { + *result = need_more; + return 1; + } + + /* There is an open non-BFINAL block, we are not going to close it just + * yet, we have compressed more than DFLTCC_BLOCK_SIZE bytes and we see + * more than DFLTCC_DHT_MIN_SAMPLE_SIZE bytes. Open a new block with a new + * DHT in order to adapt to a possibly changed input data distribution. + */ + if (param->bcf && no_flush && + strm->total_in > dfltcc_state->block_threshold && + strm->avail_in >= dfltcc_state->dht_threshold) { + if (param->cf) { + /* We need to flush the DFLTCC buffer before writing the + * End-of-block Symbol. Mask the input data and proceed as usual. + */ + masked_avail_in += strm->avail_in; + strm->avail_in = 0; + no_flush = 0; + } else { + /* DFLTCC buffer is empty, so we can manually write the + * End-of-block Symbol right away. + */ + send_eobs(strm, param); + param->bcf = 0; + dfltcc_state->block_threshold = + strm->total_in + dfltcc_state->block_size; + if (strm->avail_out == 0) { + *result = need_more; + return 1; + } + } + } + + /* The caller gave us too much data. Pass only one block worth of + * uncompressed data to DFLTCC and mask the rest, so that on the next + * iteration we start a new block. + */ + if (no_flush && strm->avail_in > dfltcc_state->block_size) { + masked_avail_in += (strm->avail_in - dfltcc_state->block_size); + strm->avail_in = dfltcc_state->block_size; + } + + /* When we have an open non-BFINAL deflate block and caller indicates that + * the stream is ending, we need to close an open deflate block and open a + * BFINAL one. + */ + need_empty_block = flush == Z_FINISH && param->bcf && !param->bhf; + + /* Translate stream to parameter block */ + param->cvt = CVT_ADLER32; + if (!no_flush) + /* We need to close a block. Always do this in software - when there is + * no input data, the hardware will not nohor BCC. */ + soft_bcc = 1; + if (flush == Z_FINISH && !param->bcf) + /* We are about to open a BFINAL block, set Block Header Final bit + * until the stream ends. + */ + param->bhf = 1; + /* DFLTCC-CMPR will write to next_out, so make sure that buffers with + * higher precedence are empty. + */ + Assert(state->pending == 0, "There must be no pending bytes"); + Assert(state->bi_valid < 8, "There must be less than 8 pending bits"); + param->sbb = (unsigned int)state->bi_valid; + if (param->sbb > 0) + *strm->next_out = (Byte)state->bi_buf; + if (param->hl) + param->nt = 0; /* Honor history */ + param->cv = strm->adler; + + /* When opening a block, choose a Huffman-Table Type */ + if (!param->bcf) { + if (strm->total_in == 0 && dfltcc_state->block_threshold > 0) { + param->htt = HTT_FIXED; + } + else { + param->htt = HTT_DYNAMIC; + dfltcc_gdht(strm); + } + } + + /* Deflate */ + do { + cc = dfltcc_cmpr(strm); + if (strm->avail_in < 4096 && masked_avail_in > 0) + /* We are about to call DFLTCC with a small input buffer, which is + * inefficient. Since there is masked data, there will be at least + * one more DFLTCC call, so skip the current one and make the next + * one handle more data. + */ + break; + } while (cc == DFLTCC_CC_AGAIN); + + /* Translate parameter block to stream */ + strm->msg = oesc_msg(dfltcc_state->msg, param->oesc); + state->bi_valid = param->sbb; + if (state->bi_valid == 0) + state->bi_buf = 0; /* Avoid accessing next_out */ + else + state->bi_buf = *strm->next_out & ((1 << state->bi_valid) - 1); + strm->adler = param->cv; + + /* Unmask the input data */ + strm->avail_in += masked_avail_in; + masked_avail_in = 0; + + /* If we encounter an error, it means there is a bug in DFLTCC call */ + Assert(cc != DFLTCC_CC_OP2_CORRUPT || param->oesc == 0, "BUG"); + + /* Update Block-Continuation Flag. It will be used to check whether to call + * GDHT the next time. + */ + if (cc == DFLTCC_CC_OK) { + if (soft_bcc) { + send_eobs(strm, param); + param->bcf = 0; + dfltcc_state->block_threshold = + strm->total_in + dfltcc_state->block_size; + } else + param->bcf = 1; + if (flush == Z_FINISH) { + if (need_empty_block) + /* Make the current deflate() call also close the stream */ + return 0; + else { + bi_windup(state); + *result = finish_done; + } + } else { + if (flush == Z_FULL_FLUSH) + param->hl = 0; /* Clear history */ + *result = flush == Z_NO_FLUSH ? need_more : block_done; + } + } else { + param->bcf = 1; + *result = need_more; + } + if (strm->avail_in != 0 && strm->avail_out != 0) + goto again; /* deflate() must use all input or all output */ + return 1; +} diff --git a/lib/zlib_dfltcc/dfltcc_inflate.c b/lib/zlib_dfltcc/dfltcc_inflate.c new file mode 100644 index 000000000000..aa9ef23474df --- /dev/null +++ b/lib/zlib_dfltcc/dfltcc_inflate.c @@ -0,0 +1,149 @@ +// SPDX-License-Identifier: Zlib + +#include "../zlib_inflate/inflate.h" +#include "dfltcc_util.h" +#include "dfltcc.h" +#include <asm/setup.h> +#include <linux/zutil.h> + +/* + * Expand. + */ +int dfltcc_can_inflate( + z_streamp strm +) +{ + struct inflate_state *state = (struct inflate_state *)strm->state; + struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state); + + /* Check for kernel dfltcc command line parameter */ + if (zlib_dfltcc_support == ZLIB_DFLTCC_DISABLED || + zlib_dfltcc_support == ZLIB_DFLTCC_DEFLATE_ONLY) + return 0; + + /* Unsupported compression settings */ + if (state->wbits != HB_BITS) + return 0; + + /* Unsupported hardware */ + return is_bit_set(dfltcc_state->af.fns, DFLTCC_XPND) && + is_bit_set(dfltcc_state->af.fmts, DFLTCC_FMT0); +} + +static int dfltcc_was_inflate_used( + z_streamp strm +) +{ + struct inflate_state *state = (struct inflate_state *)strm->state; + struct dfltcc_param_v0 *param = &GET_DFLTCC_STATE(state)->param; + + return !param->nt; +} + +static int dfltcc_inflate_disable( + z_streamp strm +) +{ + struct inflate_state *state = (struct inflate_state *)strm->state; + struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state); + + if (!dfltcc_can_inflate(strm)) + return 0; + if (dfltcc_was_inflate_used(strm)) + /* DFLTCC has already decompressed some data. Since there is not + * enough information to resume decompression in software, the call + * must fail. + */ + return 1; + /* DFLTCC was not used yet - decompress in software */ + memset(&dfltcc_state->af, 0, sizeof(dfltcc_state->af)); + return 0; +} + +static dfltcc_cc dfltcc_xpnd( + z_streamp strm +) +{ + struct inflate_state *state = (struct inflate_state *)strm->state; + struct dfltcc_param_v0 *param = &GET_DFLTCC_STATE(state)->param; + size_t avail_in = strm->avail_in; + size_t avail_out = strm->avail_out; + dfltcc_cc cc; + + cc = dfltcc(DFLTCC_XPND | HBT_CIRCULAR, + param, &strm->next_out, &avail_out, + &strm->next_in, &avail_in, state->window); + strm->avail_in = avail_in; + strm->avail_out = avail_out; + return cc; +} + +dfltcc_inflate_action dfltcc_inflate( + z_streamp strm, + int flush, + int *ret +) +{ + struct inflate_state *state = (struct inflate_state *)strm->state; + struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state); + struct dfltcc_param_v0 *param = &dfltcc_state->param; + dfltcc_cc cc; + + if (flush == Z_BLOCK) { + /* DFLTCC does not support stopping on block boundaries */ + if (dfltcc_inflate_disable(strm)) { + *ret = Z_STREAM_ERROR; + return DFLTCC_INFLATE_BREAK; + } else + return DFLTCC_INFLATE_SOFTWARE; + } + + if (state->last) { + if (state->bits != 0) { + strm->next_in++; + strm->avail_in--; + state->bits = 0; + } + state->mode = CHECK; + return DFLTCC_INFLATE_CONTINUE; + } + + if (strm->avail_in == 0 && !param->cf) + return DFLTCC_INFLATE_BREAK; + + if (!state->window || state->wsize == 0) { + state->mode = MEM; + return DFLTCC_INFLATE_CONTINUE; + } + + /* Translate stream to parameter block */ + param->cvt = CVT_ADLER32; + param->sbb = state->bits; + param->hl = state->whave; /* Software and hardware history formats match */ + param->ho = (state->write - state->whave) & ((1 << HB_BITS) - 1); + if (param->hl) + param->nt = 0; /* Honor history for the first block */ + param->cv = state->flags ? REVERSE(state->check) : state->check; + + /* Inflate */ + do { + cc = dfltcc_xpnd(strm); + } while (cc == DFLTCC_CC_AGAIN); + + /* Translate parameter block to stream */ + strm->msg = oesc_msg(dfltcc_state->msg, param->oesc); + state->last = cc == DFLTCC_CC_OK; + state->bits = param->sbb; + state->whave = param->hl; + state->write = (param->ho + param->hl) & ((1 << HB_BITS) - 1); + state->check = state->flags ? REVERSE(param->cv) : param->cv; + if (cc == DFLTCC_CC_OP2_CORRUPT && param->oesc != 0) { + /* Report an error if stream is corrupted */ + state->mode = BAD; + return DFLTCC_INFLATE_CONTINUE; + } + state->mode = TYPEDO; + /* Break if operands are exhausted, otherwise continue looping */ + return (cc == DFLTCC_CC_OP1_TOO_SHORT || cc == DFLTCC_CC_OP2_TOO_SHORT) ? + DFLTCC_INFLATE_BREAK : DFLTCC_INFLATE_CONTINUE; +} diff --git a/lib/zlib_dfltcc/dfltcc_syms.c b/lib/zlib_dfltcc/dfltcc_syms.c new file mode 100644 index 000000000000..6f23481804c1 --- /dev/null +++ b/lib/zlib_dfltcc/dfltcc_syms.c @@ -0,0 +1,17 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * linux/lib/zlib_dfltcc/dfltcc_syms.c + * + * Exported symbols for the s390 zlib dfltcc support. + * + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/zlib.h> +#include "dfltcc.h" + +EXPORT_SYMBOL(dfltcc_can_deflate); +EXPORT_SYMBOL(dfltcc_deflate); +EXPORT_SYMBOL(dfltcc_reset); +MODULE_LICENSE("GPL"); diff --git a/lib/zlib_dfltcc/dfltcc_util.h b/lib/zlib_dfltcc/dfltcc_util.h new file mode 100644 index 000000000000..4a46b5009f0d --- /dev/null +++ b/lib/zlib_dfltcc/dfltcc_util.h @@ -0,0 +1,103 @@ +// SPDX-License-Identifier: Zlib +#ifndef DFLTCC_UTIL_H +#define DFLTCC_UTIL_H + +#include <linux/zutil.h> + +/* + * C wrapper for the DEFLATE CONVERSION CALL instruction. + */ +typedef enum { + DFLTCC_CC_OK = 0, + DFLTCC_CC_OP1_TOO_SHORT = 1, + DFLTCC_CC_OP2_TOO_SHORT = 2, + DFLTCC_CC_OP2_CORRUPT = 2, + DFLTCC_CC_AGAIN = 3, +} dfltcc_cc; + +#define DFLTCC_QAF 0 +#define DFLTCC_GDHT 1 +#define DFLTCC_CMPR 2 +#define DFLTCC_XPND 4 +#define HBT_CIRCULAR (1 << 7) +#define HB_BITS 15 +#define HB_SIZE (1 << HB_BITS) + +static inline dfltcc_cc dfltcc( + int fn, + void *param, + Byte **op1, + size_t *len1, + const Byte **op2, + size_t *len2, + void *hist +) +{ + Byte *t2 = op1 ? *op1 : NULL; + size_t t3 = len1 ? *len1 : 0; + const Byte *t4 = op2 ? *op2 : NULL; + size_t t5 = len2 ? *len2 : 0; + register int r0 __asm__("r0") = fn; + register void *r1 __asm__("r1") = param; + register Byte *r2 __asm__("r2") = t2; + register size_t r3 __asm__("r3") = t3; + register const Byte *r4 __asm__("r4") = t4; + register size_t r5 __asm__("r5") = t5; + int cc; + + __asm__ volatile( + ".insn rrf,0xb9390000,%[r2],%[r4],%[hist],0\n" + "ipm %[cc]\n" + : [r2] "+r" (r2) + , [r3] "+r" (r3) + , [r4] "+r" (r4) + , [r5] "+r" (r5) + , [cc] "=r" (cc) + : [r0] "r" (r0) + , [r1] "r" (r1) + , [hist] "r" (hist) + : "cc", "memory"); + t2 = r2; t3 = r3; t4 = r4; t5 = r5; + + if (op1) + *op1 = t2; + if (len1) + *len1 = t3; + if (op2) + *op2 = t4; + if (len2) + *len2 = t5; + return (cc >> 28) & 3; +} + +static inline int is_bit_set( + const char *bits, + int n +) +{ + return bits[n / 8] & (1 << (7 - (n % 8))); +} + +static inline void turn_bit_off( + char *bits, + int n +) +{ + bits[n / 8] &= ~(1 << (7 - (n % 8))); +} + +static inline int dfltcc_are_params_ok( + int level, + uInt window_bits, + int strategy, + uLong level_mask +) +{ + return (level_mask & (1 << level)) != 0 && + (window_bits == HB_BITS) && + (strategy == Z_DEFAULT_STRATEGY); +} + +char *oesc_msg(char *buf, int oesc); + +#endif /* DFLTCC_UTIL_H */ diff --git a/lib/zlib_inflate/inflate.c b/lib/zlib_inflate/inflate.c index 48f14cd58c77..67cc9b08ae9d 100644 --- a/lib/zlib_inflate/inflate.c +++ b/lib/zlib_inflate/inflate.c @@ -15,6 +15,16 @@ #include "inffast.h" #include "infutil.h" +/* architecture-specific bits */ +#ifdef CONFIG_ZLIB_DFLTCC +# include "../zlib_dfltcc/dfltcc.h" +#else +#define INFLATE_RESET_HOOK(strm) do {} while (0) +#define INFLATE_TYPEDO_HOOK(strm, flush) do {} while (0) +#define INFLATE_NEED_UPDATEWINDOW(strm) 1 +#define INFLATE_NEED_CHECKSUM(strm) 1 +#endif + int zlib_inflate_workspacesize(void) { return sizeof(struct inflate_workspace); @@ -42,6 +52,7 @@ int zlib_inflateReset(z_streamp strm) state->write = 0; state->whave = 0; + INFLATE_RESET_HOOK(strm); return Z_OK; } @@ -66,7 +77,15 @@ int zlib_inflateInit2(z_streamp strm, int windowBits) return Z_STREAM_ERROR; } state->wbits = (unsigned)windowBits; +#ifdef CONFIG_ZLIB_DFLTCC + /* + * DFLTCC requires the window to be page aligned. + * Thus, we overallocate and take the aligned portion of the buffer. + */ + state->window = PTR_ALIGN(&WS(strm)->working_window[0], PAGE_SIZE); +#else state->window = &WS(strm)->working_window[0]; +#endif return zlib_inflateReset(strm); } @@ -227,11 +246,6 @@ static int zlib_inflateSyncPacket(z_streamp strm) bits -= bits & 7; \ } while (0) -/* Reverse the bytes in a 32-bit value */ -#define REVERSE(q) \ - ((((q) >> 24) & 0xff) + (((q) >> 8) & 0xff00) + \ - (((q) & 0xff00) << 8) + (((q) & 0xff) << 24)) - /* inflate() uses a state machine to process as much input data and generate as much output data as possible before returning. The state machine is @@ -395,6 +409,7 @@ int zlib_inflate(z_streamp strm, int flush) if (flush == Z_BLOCK) goto inf_leave; /* fall through */ case TYPEDO: + INFLATE_TYPEDO_HOOK(strm, flush); if (state->last) { BYTEBITS(); state->mode = CHECK; @@ -692,7 +707,7 @@ int zlib_inflate(z_streamp strm, int flush) out -= left; strm->total_out += out; state->total += out; - if (out) + if (INFLATE_NEED_CHECKSUM(strm) && out) strm->adler = state->check = UPDATE(state->check, put - out, out); out = left; @@ -726,7 +741,8 @@ int zlib_inflate(z_streamp strm, int flush) */ inf_leave: RESTORE(); - if (state->wsize || (state->mode < CHECK && out != strm->avail_out)) + if (INFLATE_NEED_UPDATEWINDOW(strm) && + (state->wsize || (state->mode < CHECK && out != strm->avail_out))) zlib_updatewindow(strm, out); in -= strm->avail_in; @@ -734,7 +750,7 @@ int zlib_inflate(z_streamp strm, int flush) strm->total_in += in; strm->total_out += out; state->total += out; - if (state->wrap && out) + if (INFLATE_NEED_CHECKSUM(strm) && state->wrap && out) strm->adler = state->check = UPDATE(state->check, strm->next_out - out, out); diff --git a/lib/zlib_inflate/inflate.h b/lib/zlib_inflate/inflate.h index 3d17b3d1b21f..f79337ddf98c 100644 --- a/lib/zlib_inflate/inflate.h +++ b/lib/zlib_inflate/inflate.h @@ -11,6 +11,8 @@ subject to change. Applications should only use zlib.h. */ +#include "inftrees.h" + /* Possible inflate modes between inflate() calls */ typedef enum { HEAD, /* i: waiting for magic header */ @@ -108,4 +110,10 @@ struct inflate_state { unsigned short work[288]; /* work area for code table building */ code codes[ENOUGH]; /* space for code tables */ }; + +/* Reverse the bytes in a 32-bit value */ +#define REVERSE(q) \ + ((((q) >> 24) & 0xff) + (((q) >> 8) & 0xff00) + \ + (((q) & 0xff00) << 8) + (((q) & 0xff) << 24)) + #endif diff --git a/lib/zlib_inflate/infutil.h b/lib/zlib_inflate/infutil.h index eb1a9007bd86..784ab33b7842 100644 --- a/lib/zlib_inflate/infutil.h +++ b/lib/zlib_inflate/infutil.h @@ -12,14 +12,28 @@ #define _INFUTIL_H #include <linux/zlib.h> +#ifdef CONFIG_ZLIB_DFLTCC +#include "../zlib_dfltcc/dfltcc.h" +#include <asm/page.h> +#endif /* memory allocation for inflation */ struct inflate_workspace { struct inflate_state inflate_state; - unsigned char working_window[1 << MAX_WBITS]; +#ifdef CONFIG_ZLIB_DFLTCC + struct dfltcc_state dfltcc_state; + unsigned char working_window[(1 << MAX_WBITS) + PAGE_SIZE]; +#else + unsigned char working_window[(1 << MAX_WBITS)]; +#endif }; -#define WS(z) ((struct inflate_workspace *)(z->workspace)) +#ifdef CONFIG_ZLIB_DFLTCC +/* dfltcc_state must be doubleword aligned for DFLTCC call */ +static_assert(offsetof(struct inflate_workspace, dfltcc_state) % 8 == 0); +#endif + +#define WS(strm) ((struct inflate_workspace *)(strm->workspace)) #endif diff --git a/mm/Makefile b/mm/Makefile index 1937cc251883..32f08e22e824 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -20,6 +20,7 @@ KCOV_INSTRUMENT_kmemleak.o := n KCOV_INSTRUMENT_memcontrol.o := n KCOV_INSTRUMENT_mmzone.o := n KCOV_INSTRUMENT_vmstat.o := n +KCOV_INSTRUMENT_failslab.o := n CFLAGS_init-mm.o += $(call cc-disable-warning, override-init) CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index c360f6a6c844..62f05f605fb5 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -21,6 +21,7 @@ struct backing_dev_info noop_backing_dev_info = { EXPORT_SYMBOL_GPL(noop_backing_dev_info); static struct class *bdi_class; +const char *bdi_unknown_name = "(unknown)"; /* * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU diff --git a/mm/debug.c b/mm/debug.c index 74ee73cf7079..ecccd9f17801 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -46,7 +46,15 @@ void __dump_page(struct page *page, const char *reason) { struct address_space *mapping; bool page_poisoned = PagePoisoned(page); + /* + * Accessing the pageblock without the zone lock. It could change to + * "isolate" again in the meantime, but since we are just dumping the + * state for debugging, it should be fine to accept a bit of + * inaccuracy here due to racing. + */ + bool page_cma = is_migrate_cma_page(page); int mapcount; + char *type = ""; /* * If struct page is poisoned don't access Page*() functions as that @@ -78,9 +86,9 @@ void __dump_page(struct page *page, const char *reason) page, page_ref_count(page), mapcount, page->mapping, page_to_pgoff(page)); if (PageKsm(page)) - pr_warn("ksm flags: %#lx(%pGp)\n", page->flags, &page->flags); + type = "ksm "; else if (PageAnon(page)) - pr_warn("anon flags: %#lx(%pGp)\n", page->flags, &page->flags); + type = "anon "; else if (mapping) { if (mapping->host && mapping->host->i_dentry.first) { struct dentry *dentry; @@ -88,10 +96,12 @@ void __dump_page(struct page *page, const char *reason) pr_warn("%ps name:\"%pd\"\n", mapping->a_ops, dentry); } else pr_warn("%ps\n", mapping->a_ops); - pr_warn("flags: %#lx(%pGp)\n", page->flags, &page->flags); } BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS + 1); + pr_warn("%sflags: %#lx(%pGp)%s\n", type, page->flags, &page->flags, + page_cma ? " CMA" : ""); + hex_only: print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32, sizeof(unsigned long), page, diff --git a/mm/early_ioremap.c b/mm/early_ioremap.c index 1826f191e72c..a0018ad1a1f6 100644 --- a/mm/early_ioremap.c +++ b/mm/early_ioremap.c @@ -121,8 +121,8 @@ __early_ioremap(resource_size_t phys_addr, unsigned long size, pgprot_t prot) } } - if (WARN(slot < 0, "%s(%08llx, %08lx) not found slot\n", - __func__, (u64)phys_addr, size)) + if (WARN(slot < 0, "%s(%pa, %08lx) not found slot\n", + __func__, &phys_addr, size)) return NULL; /* Don't allow wraparound or zero size */ @@ -158,8 +158,8 @@ __early_ioremap(resource_size_t phys_addr, unsigned long size, pgprot_t prot) --idx; --nrpages; } - WARN(early_ioremap_debug, "%s(%08llx, %08lx) [%d] => %08lx + %08lx\n", - __func__, (u64)phys_addr, size, slot, offset, slot_virt[slot]); + WARN(early_ioremap_debug, "%s(%pa, %08lx) [%d] => %08lx + %08lx\n", + __func__, &phys_addr, size, slot, offset, slot_virt[slot]); prev_map[slot] = (void __iomem *)(offset + slot_virt[slot]); return prev_map[slot]; diff --git a/mm/filemap.c b/mm/filemap.c index bf6aa30be58d..1784478270e1 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -632,33 +632,6 @@ static bool mapping_needs_writeback(struct address_space *mapping) return mapping->nrpages; } -int filemap_write_and_wait(struct address_space *mapping) -{ - int err = 0; - - if (mapping_needs_writeback(mapping)) { - err = filemap_fdatawrite(mapping); - /* - * Even if the above returned error, the pages may be - * written partially (e.g. -ENOSPC), so we wait for it. - * But the -EIO is special case, it may indicate the worst - * thing (e.g. bug) happened, so we avoid waiting for it. - */ - if (err != -EIO) { - int err2 = filemap_fdatawait(mapping); - if (!err) - err = err2; - } else { - /* Clear any previously stored errors */ - filemap_check_errors(mapping); - } - } else { - err = filemap_check_errors(mapping); - } - return err; -} -EXPORT_SYMBOL(filemap_write_and_wait); - /** * filemap_write_and_wait_range - write out & wait on a file range * @mapping: the address_space for the pages @@ -680,7 +653,12 @@ int filemap_write_and_wait_range(struct address_space *mapping, if (mapping_needs_writeback(mapping)) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); - /* See comment of filemap_write_and_wait() */ + /* + * Even if the above returned error, the pages may be + * written partially (e.g. -ENOSPC), so we wait for it. + * But the -EIO is special case, it may indicate the worst + * thing (e.g. bug) happened, so we avoid waiting for it. + */ if (err != -EIO) { int err2 = filemap_fdatawait_range(mapping, lstart, lend); @@ -29,8 +29,23 @@ struct follow_page_context { unsigned int page_mask; }; +/* + * Return the compound head page with ref appropriately incremented, + * or NULL if that failed. + */ +static inline struct page *try_get_compound_head(struct page *page, int refs) +{ + struct page *head = compound_head(page); + + if (WARN_ON_ONCE(page_ref_count(head) < 0)) + return NULL; + if (unlikely(!page_cache_add_speculative(head, refs))) + return NULL; + return head; +} + /** - * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages + * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages * @pages: array of pages to be maybe marked dirty, and definitely released. * @npages: number of pages in the @pages array. * @make_dirty: whether to mark the pages dirty @@ -40,19 +55,19 @@ struct follow_page_context { * * For each page in the @pages array, make that page (or its head page, if a * compound page) dirty, if @make_dirty is true, and if the page was previously - * listed as clean. In any case, releases all pages using put_user_page(), - * possibly via put_user_pages(), for the non-dirty case. + * listed as clean. In any case, releases all pages using unpin_user_page(), + * possibly via unpin_user_pages(), for the non-dirty case. * - * Please see the put_user_page() documentation for details. + * Please see the unpin_user_page() documentation for details. * * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is * required, then the caller should a) verify that this is really correct, * because _lock() is usually required, and b) hand code it: - * set_page_dirty_lock(), put_user_page(). + * set_page_dirty_lock(), unpin_user_page(). * */ -void put_user_pages_dirty_lock(struct page **pages, unsigned long npages, - bool make_dirty) +void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, + bool make_dirty) { unsigned long index; @@ -63,7 +78,7 @@ void put_user_pages_dirty_lock(struct page **pages, unsigned long npages, */ if (!make_dirty) { - put_user_pages(pages, npages); + unpin_user_pages(pages, npages); return; } @@ -91,21 +106,21 @@ void put_user_pages_dirty_lock(struct page **pages, unsigned long npages, */ if (!PageDirty(page)) set_page_dirty_lock(page); - put_user_page(page); + unpin_user_page(page); } } -EXPORT_SYMBOL(put_user_pages_dirty_lock); +EXPORT_SYMBOL(unpin_user_pages_dirty_lock); /** - * put_user_pages() - release an array of gup-pinned pages. + * unpin_user_pages() - release an array of gup-pinned pages. * @pages: array of pages to be marked dirty and released. * @npages: number of pages in the @pages array. * - * For each page in the @pages array, release the page using put_user_page(). + * For each page in the @pages array, release the page using unpin_user_page(). * - * Please see the put_user_page() documentation for details. + * Please see the unpin_user_page() documentation for details. */ -void put_user_pages(struct page **pages, unsigned long npages) +void unpin_user_pages(struct page **pages, unsigned long npages) { unsigned long index; @@ -115,9 +130,9 @@ void put_user_pages(struct page **pages, unsigned long npages) * single operation to the head page should suffice. */ for (index = 0; index < npages; index++) - put_user_page(pages[index]); + unpin_user_page(pages[index]); } -EXPORT_SYMBOL(put_user_pages); +EXPORT_SYMBOL(unpin_user_pages); #ifdef CONFIG_MMU static struct page *no_page_table(struct vm_area_struct *vma, @@ -179,6 +194,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, spinlock_t *ptl; pte_t *ptep, pte; + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == + (FOLL_PIN | FOLL_GET))) + return ERR_PTR(-EINVAL); retry: if (unlikely(pmd_bad(*pmd))) return no_page_table(vma, flags); @@ -323,7 +342,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, pmdval = READ_ONCE(*pmd); if (pmd_none(pmdval)) return no_page_table(vma, flags); - if (pmd_huge(pmdval) && vma->vm_flags & VM_HUGETLB) { + if (pmd_huge(pmdval) && is_vm_hugetlb_page(vma)) { page = follow_huge_pmd(mm, address, pmd, flags); if (page) return page; @@ -433,7 +452,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma, pud = pud_offset(p4dp, address); if (pud_none(*pud)) return no_page_table(vma, flags); - if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) { + if (pud_huge(*pud) && is_vm_hugetlb_page(vma)) { page = follow_huge_pud(mm, address, pud, flags); if (page) return page; @@ -796,7 +815,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, start = untagged_addr(start); - VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET)); + VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN))); /* * If FOLL_FORCE is set then do not force a full fault as the hinting @@ -1020,7 +1039,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, BUG_ON(*locked != 1); } - if (pages) + /* + * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior + * is to set FOLL_GET if the caller wants pages[] filled in (but has + * carelessly failed to specify FOLL_GET), so keep doing that, but only + * for FOLL_GET, not for the newer FOLL_PIN. + * + * FOLL_PIN always expects pages to be non-null, but no need to assert + * that here, as any failures will be obvious enough. + */ + if (pages && !(flags & FOLL_PIN)) flags |= FOLL_GET; pages_done = 0; @@ -1096,88 +1124,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, return pages_done; } -/* - * get_user_pages_remote() - pin user pages in memory - * @tsk: the task_struct to use for page fault accounting, or - * NULL if faults are not to be recorded. - * @mm: mm_struct of target mm - * @start: starting user address - * @nr_pages: number of pages from start to pin - * @gup_flags: flags modifying lookup behaviour - * @pages: array that receives pointers to the pages pinned. - * Should be at least nr_pages long. Or NULL, if caller - * only intends to ensure the pages are faulted in. - * @vmas: array of pointers to vmas corresponding to each page. - * Or NULL if the caller does not require them. - * @locked: pointer to lock flag indicating whether lock is held and - * subsequently whether VM_FAULT_RETRY functionality can be - * utilised. Lock must initially be held. - * - * Returns either number of pages pinned (which may be less than the - * number requested), or an error. Details about the return value: - * - * -- If nr_pages is 0, returns 0. - * -- If nr_pages is >0, but no pages were pinned, returns -errno. - * -- If nr_pages is >0, and some pages were pinned, returns the number of - * pages pinned. Again, this may be less than nr_pages. - * - * The caller is responsible for releasing returned @pages, via put_page(). - * - * @vmas are valid only as long as mmap_sem is held. - * - * Must be called with mmap_sem held for read or write. - * - * get_user_pages walks a process's page tables and takes a reference to - * each struct page that each user address corresponds to at a given - * instant. That is, it takes the page that would be accessed if a user - * thread accesses the given user virtual address at that instant. - * - * This does not guarantee that the page exists in the user mappings when - * get_user_pages returns, and there may even be a completely different - * page there in some cases (eg. if mmapped pagecache has been invalidated - * and subsequently re faulted). However it does guarantee that the page - * won't be freed completely. And mostly callers simply care that the page - * contains data that was valid *at some point in time*. Typically, an IO - * or similar operation cannot guarantee anything stronger anyway because - * locks can't be held over the syscall boundary. - * - * If gup_flags & FOLL_WRITE == 0, the page must not be written to. If the page - * is written to, set_page_dirty (or set_page_dirty_lock, as appropriate) must - * be called after the page is finished with, and before put_page is called. - * - * get_user_pages is typically used for fewer-copy IO operations, to get a - * handle on the memory by some means other than accesses via the user virtual - * addresses. The pages may be submitted for DMA to devices or accessed via - * their kernel linear mapping (via the kmap APIs). Care should be taken to - * use the correct cache flushing APIs. - * - * See also get_user_pages_fast, for performance critical applications. - * - * get_user_pages should be phased out in favor of - * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing - * should use get_user_pages because it cannot pass - * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault. - */ -long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, unsigned long nr_pages, - unsigned int gup_flags, struct page **pages, - struct vm_area_struct **vmas, int *locked) -{ - /* - * FIXME: Current FOLL_LONGTERM behavior is incompatible with - * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on - * vmas. As there are no users of this flag in this call we simply - * disallow this option for now. - */ - if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM)) - return -EINVAL; - - return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas, - locked, - gup_flags | FOLL_TOUCH | FOLL_REMOTE); -} -EXPORT_SYMBOL(get_user_pages_remote); - /** * populate_vma_page_range() - populate a range of pages in the vma. * @vma: target vma @@ -1612,6 +1558,116 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk, #endif /* CONFIG_FS_DAX || CONFIG_CMA */ /* + * get_user_pages_remote() - pin user pages in memory + * @tsk: the task_struct to use for page fault accounting, or + * NULL if faults are not to be recorded. + * @mm: mm_struct of target mm + * @start: starting user address + * @nr_pages: number of pages from start to pin + * @gup_flags: flags modifying lookup behaviour + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. Or NULL, if caller + * only intends to ensure the pages are faulted in. + * @vmas: array of pointers to vmas corresponding to each page. + * Or NULL if the caller does not require them. + * @locked: pointer to lock flag indicating whether lock is held and + * subsequently whether VM_FAULT_RETRY functionality can be + * utilised. Lock must initially be held. + * + * Returns either number of pages pinned (which may be less than the + * number requested), or an error. Details about the return value: + * + * -- If nr_pages is 0, returns 0. + * -- If nr_pages is >0, but no pages were pinned, returns -errno. + * -- If nr_pages is >0, and some pages were pinned, returns the number of + * pages pinned. Again, this may be less than nr_pages. + * + * The caller is responsible for releasing returned @pages, via put_page(). + * + * @vmas are valid only as long as mmap_sem is held. + * + * Must be called with mmap_sem held for read or write. + * + * get_user_pages walks a process's page tables and takes a reference to + * each struct page that each user address corresponds to at a given + * instant. That is, it takes the page that would be accessed if a user + * thread accesses the given user virtual address at that instant. + * + * This does not guarantee that the page exists in the user mappings when + * get_user_pages returns, and there may even be a completely different + * page there in some cases (eg. if mmapped pagecache has been invalidated + * and subsequently re faulted). However it does guarantee that the page + * won't be freed completely. And mostly callers simply care that the page + * contains data that was valid *at some point in time*. Typically, an IO + * or similar operation cannot guarantee anything stronger anyway because + * locks can't be held over the syscall boundary. + * + * If gup_flags & FOLL_WRITE == 0, the page must not be written to. If the page + * is written to, set_page_dirty (or set_page_dirty_lock, as appropriate) must + * be called after the page is finished with, and before put_page is called. + * + * get_user_pages is typically used for fewer-copy IO operations, to get a + * handle on the memory by some means other than accesses via the user virtual + * addresses. The pages may be submitted for DMA to devices or accessed via + * their kernel linear mapping (via the kmap APIs). Care should be taken to + * use the correct cache flushing APIs. + * + * See also get_user_pages_fast, for performance critical applications. + * + * get_user_pages should be phased out in favor of + * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing + * should use get_user_pages because it cannot pass + * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault. + */ +#ifdef CONFIG_MMU +long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked) +{ + /* + * FOLL_PIN must only be set internally by the pin_user_pages*() APIs, + * never directly by the caller, so enforce that with an assertion: + */ + if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) + return -EINVAL; + + /* + * Parts of FOLL_LONGTERM behavior are incompatible with + * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on + * vmas. However, this only comes up if locked is set, and there are + * callers that do request FOLL_LONGTERM, but do not set locked. So, + * allow what we can. + */ + if (gup_flags & FOLL_LONGTERM) { + if (WARN_ON_ONCE(locked)) + return -EINVAL; + /* + * This will check the vmas (even if our vmas arg is NULL) + * and return -ENOTSUPP if DAX isn't allowed in this case: + */ + return __gup_longterm_locked(tsk, mm, start, nr_pages, pages, + vmas, gup_flags | FOLL_TOUCH | + FOLL_REMOTE); + } + + return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas, + locked, + gup_flags | FOLL_TOUCH | FOLL_REMOTE); +} +EXPORT_SYMBOL(get_user_pages_remote); + +#else /* CONFIG_MMU */ +long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked) +{ + return 0; +} +#endif /* !CONFIG_MMU */ + +/* * This is the same as get_user_pages_remote(), just with a * less-flexible calling convention where we assume that the task * and mm being operated on are the current task's and don't allow @@ -1622,6 +1678,13 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas) { + /* + * FOLL_PIN must only be set internally by the pin_user_pages*() APIs, + * never directly by the caller, so enforce that with an assertion: + */ + if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) + return -EINVAL; + return __gup_longterm_locked(current, current->mm, start, nr_pages, pages, vmas, gup_flags | FOLL_TOUCH); } @@ -1807,20 +1870,6 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, } } -/* - * Return the compund head page with ref appropriately incremented, - * or NULL if that failed. - */ -static inline struct page *try_get_compound_head(struct page *page, int refs) -{ - struct page *head = compound_head(page); - if (WARN_ON_ONCE(page_ref_count(head) < 0)) - return NULL; - if (unlikely(!page_cache_add_speculative(head, refs))) - return NULL; - return head; -} - #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) @@ -1978,6 +2027,29 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr, } #endif +static int record_subpages(struct page *page, unsigned long addr, + unsigned long end, struct page **pages) +{ + int nr; + + for (nr = 0; addr != end; addr += PAGE_SIZE) + pages[nr++] = page++; + + return nr; +} + +static void put_compound_head(struct page *page, int refs) +{ + VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); + /* + * Calling put_page() for each ref is unnecessarily slow. Only the last + * ref needs a put_page(). + */ + if (refs > 1) + page_ref_sub(page, refs - 1); + put_page(page); +} + #ifdef CONFIG_ARCH_HAS_HUGEPD static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, unsigned long sz) @@ -2007,32 +2079,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, /* hugepages are never "special" */ VM_BUG_ON(!pfn_valid(pte_pfn(pte))); - refs = 0; head = pte_page(pte); - page = head + ((addr & (sz-1)) >> PAGE_SHIFT); - do { - VM_BUG_ON(compound_head(page) != head); - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = record_subpages(page, addr, end, pages + *nr); head = try_get_compound_head(head, refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pte_val(pte) != pte_val(*ptep))) { - /* Could be optimized better */ - *nr -= refs; - while (refs--) - put_page(head); + put_compound_head(head, refs); return 0; } + *nr += refs; SetPageReferenced(head); return 1; } @@ -2079,28 +2139,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr); } - refs = 0; page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - do { - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = record_subpages(page, addr, end, pages + *nr); head = try_get_compound_head(pmd_page(orig), refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { - *nr -= refs; - while (refs--) - put_page(head); + put_compound_head(head, refs); return 0; } + *nr += refs; SetPageReferenced(head); return 1; } @@ -2120,28 +2171,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr); } - refs = 0; page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - do { - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = record_subpages(page, addr, end, pages + *nr); head = try_get_compound_head(pud_page(orig), refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pud_val(orig) != pud_val(*pudp))) { - *nr -= refs; - while (refs--) - put_page(head); + put_compound_head(head, refs); return 0; } + *nr += refs; SetPageReferenced(head); return 1; } @@ -2157,28 +2199,20 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, return 0; BUILD_BUG_ON(pgd_devmap(orig)); - refs = 0; + page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT); - do { - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = record_subpages(page, addr, end, pages + *nr); head = try_get_compound_head(pgd_page(orig), refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) { - *nr -= refs; - while (refs--) - put_page(head); + put_compound_head(head, refs); return 0; } + *nr += refs; SetPageReferenced(head); return 1; } @@ -2237,7 +2271,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, pud_t pud = READ_ONCE(*pudp); next = pud_addr_end(addr, end); - if (pud_none(pud)) + if (unlikely(!pud_present(pud))) return 0; if (unlikely(pud_huge(pud))) { if (!gup_huge_pud(pud, pudp, addr, next, flags, @@ -2393,29 +2427,15 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages, return ret; } -/** - * get_user_pages_fast() - pin user pages in memory - * @start: starting user address - * @nr_pages: number of pages from start to pin - * @gup_flags: flags modifying pin behaviour - * @pages: array that receives pointers to the pages pinned. - * Should be at least nr_pages long. - * - * Attempt to pin user pages in memory without taking mm->mmap_sem. - * If not successful, it will fall back to taking the lock and - * calling get_user_pages(). - * - * Returns number of pages pinned. This may be fewer than the number - * requested. If nr_pages is 0 or negative, returns 0. If no pages - * were pinned, returns -errno. - */ -int get_user_pages_fast(unsigned long start, int nr_pages, - unsigned int gup_flags, struct page **pages) +static int internal_get_user_pages_fast(unsigned long start, int nr_pages, + unsigned int gup_flags, + struct page **pages) { unsigned long addr, len, end; int nr = 0, ret = 0; - if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM))) + if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | + FOLL_FORCE | FOLL_PIN))) return -EINVAL; start = untagged_addr(start) & PAGE_MASK; @@ -2455,4 +2475,103 @@ int get_user_pages_fast(unsigned long start, int nr_pages, return ret; } + +/** + * get_user_pages_fast() - pin user pages in memory + * @start: starting user address + * @nr_pages: number of pages from start to pin + * @gup_flags: flags modifying pin behaviour + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. + * + * Attempt to pin user pages in memory without taking mm->mmap_sem. + * If not successful, it will fall back to taking the lock and + * calling get_user_pages(). + * + * Returns number of pages pinned. This may be fewer than the number requested. + * If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns + * -errno. + */ +int get_user_pages_fast(unsigned long start, int nr_pages, + unsigned int gup_flags, struct page **pages) +{ + /* + * FOLL_PIN must only be set internally by the pin_user_pages*() APIs, + * never directly by the caller, so enforce that: + */ + if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) + return -EINVAL; + + return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages); +} EXPORT_SYMBOL_GPL(get_user_pages_fast); + +/** + * pin_user_pages_fast() - pin user pages in memory without taking locks + * + * For now, this is a placeholder function, until various call sites are + * converted to use the correct get_user_pages*() or pin_user_pages*() API. So, + * this is identical to get_user_pages_fast(). + * + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It + * is NOT intended for Case 2 (RDMA: long-term pins). + */ +int pin_user_pages_fast(unsigned long start, int nr_pages, + unsigned int gup_flags, struct page **pages) +{ + /* + * This is a placeholder, until the pin functionality is activated. + * Until then, just behave like the corresponding get_user_pages*() + * routine. + */ + return get_user_pages_fast(start, nr_pages, gup_flags, pages); +} +EXPORT_SYMBOL_GPL(pin_user_pages_fast); + +/** + * pin_user_pages_remote() - pin pages of a remote process (task != current) + * + * For now, this is a placeholder function, until various call sites are + * converted to use the correct get_user_pages*() or pin_user_pages*() API. So, + * this is identical to get_user_pages_remote(). + * + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It + * is NOT intended for Case 2 (RDMA: long-term pins). + */ +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked) +{ + /* + * This is a placeholder, until the pin functionality is activated. + * Until then, just behave like the corresponding get_user_pages*() + * routine. + */ + return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages, + vmas, locked); +} +EXPORT_SYMBOL(pin_user_pages_remote); + +/** + * pin_user_pages() - pin user pages in memory for use by other devices + * + * For now, this is a placeholder function, until various call sites are + * converted to use the correct get_user_pages*() or pin_user_pages*() API. So, + * this is identical to get_user_pages(). + * + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It + * is NOT intended for Case 2 (RDMA: long-term pins). + */ +long pin_user_pages(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas) +{ + /* + * This is a placeholder, until the pin functionality is activated. + * Until then, just behave like the corresponding get_user_pages*() + * routine. + */ + return get_user_pages(start, nr_pages, gup_flags, pages, vmas); +} +EXPORT_SYMBOL(pin_user_pages); diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c index ad9d5b1c4473..8dba38e79a9f 100644 --- a/mm/gup_benchmark.c +++ b/mm/gup_benchmark.c @@ -49,18 +49,21 @@ static int __gup_benchmark_ioctl(unsigned int cmd, nr = (next - addr) / PAGE_SIZE; } + /* Filter out most gup flags: only allow a tiny subset here: */ + gup->flags &= FOLL_WRITE; + switch (cmd) { case GUP_FAST_BENCHMARK: - nr = get_user_pages_fast(addr, nr, gup->flags & 1, + nr = get_user_pages_fast(addr, nr, gup->flags, pages + i); break; case GUP_LONGTERM_BENCHMARK: nr = get_user_pages(addr, nr, - (gup->flags & 1) | FOLL_LONGTERM, + gup->flags | FOLL_LONGTERM, pages + i, NULL); break; case GUP_BENCHMARK: - nr = get_user_pages(addr, nr, gup->flags & 1, pages + i, + nr = get_user_pages(addr, nr, gup->flags, pages + i, NULL); break; default: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f39689a29128..b08b199f9a11 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -177,16 +177,13 @@ static ssize_t enabled_store(struct kobject *kobj, { ssize_t ret = count; - if (!memcmp("always", buf, - min(sizeof("always")-1, count))) { + if (sysfs_streq(buf, "always")) { clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); - } else if (!memcmp("madvise", buf, - min(sizeof("madvise")-1, count))) { + } else if (sysfs_streq(buf, "madvise")) { clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); - } else if (!memcmp("never", buf, - min(sizeof("never")-1, count))) { + } else if (sysfs_streq(buf, "never")) { clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); } else @@ -250,32 +247,27 @@ static ssize_t defrag_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { - if (!memcmp("always", buf, - min(sizeof("always")-1, count))) { + if (sysfs_streq(buf, "always")) { clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags); set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags); - } else if (!memcmp("defer+madvise", buf, - min(sizeof("defer+madvise")-1, count))) { + } else if (sysfs_streq(buf, "defer+madvise")) { clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags); set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags); - } else if (!memcmp("defer", buf, - min(sizeof("defer")-1, count))) { + } else if (sysfs_streq(buf, "defer")) { clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags); set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags); - } else if (!memcmp("madvise", buf, - min(sizeof("madvise")-1, count))) { + } else if (sysfs_streq(buf, "madvise")) { clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags); set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags); - } else if (!memcmp("never", buf, - min(sizeof("never")-1, count))) { + } else if (sysfs_streq(buf, "never")) { clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags); clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags); @@ -2715,7 +2707,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) { struct page *head = compound_head(page); struct pglist_data *pgdata = NODE_DATA(page_to_nid(head)); - struct deferred_split *ds_queue = get_deferred_split_queue(page); + struct deferred_split *ds_queue = get_deferred_split_queue(head); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int count, mapcount, extra_pins, ret; @@ -2723,11 +2715,11 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) unsigned long flags; pgoff_t end; - VM_BUG_ON_PAGE(is_huge_zero_page(page), page); - VM_BUG_ON_PAGE(!PageLocked(page), page); - VM_BUG_ON_PAGE(!PageCompound(page), page); + VM_BUG_ON_PAGE(is_huge_zero_page(head), head); + VM_BUG_ON_PAGE(!PageLocked(head), head); + VM_BUG_ON_PAGE(!PageCompound(head), head); - if (PageWriteback(page)) + if (PageWriteback(head)) return -EBUSY; if (PageAnon(head)) { @@ -2778,7 +2770,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) goto out_unlock; } - mlocked = PageMlocked(page); + mlocked = PageMlocked(head); unmap_page(head); VM_BUG_ON_PAGE(compound_mapcount(head), head); @@ -2810,14 +2802,14 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) ds_queue->split_queue_len--; list_del(page_deferred_list(head)); } + spin_unlock(&ds_queue->split_queue_lock); if (mapping) { - if (PageSwapBacked(page)) - __dec_node_page_state(page, NR_SHMEM_THPS); + if (PageSwapBacked(head)) + __dec_node_page_state(head, NR_SHMEM_THPS); else - __dec_node_page_state(page, NR_FILE_THPS); + __dec_node_page_state(head, NR_FILE_THPS); } - spin_unlock(&ds_queue->split_queue_lock); __split_huge_page(page, list, end, flags); if (PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 244607663363..3a4259eeb5a0 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -13,7 +13,7 @@ * * The following locks and mutexes are used by kmemleak: * - * - kmemleak_lock (rwlock): protects the object_list modifications and + * - kmemleak_lock (raw_spinlock_t): protects the object_list modifications and * accesses to the object_tree_root. The object_list is the main list * holding the metadata (struct kmemleak_object) for the allocated memory * blocks. The object_tree_root is a red black tree used to look-up @@ -22,13 +22,13 @@ * object_tree_root in the create_object() function called from the * kmemleak_alloc() callback and removed in delete_object() called from the * kmemleak_free() callback - * - kmemleak_object.lock (spinlock): protects a kmemleak_object. Accesses to - * the metadata (e.g. count) are protected by this lock. Note that some - * members of this structure may be protected by other means (atomic or - * kmemleak_lock). This lock is also held when scanning the corresponding - * memory block to avoid the kernel freeing it via the kmemleak_free() - * callback. This is less heavyweight than holding a global lock like - * kmemleak_lock during scanning + * - kmemleak_object.lock (raw_spinlock_t): protects a kmemleak_object. + * Accesses to the metadata (e.g. count) are protected by this lock. Note + * that some members of this structure may be protected by other means + * (atomic or kmemleak_lock). This lock is also held when scanning the + * corresponding memory block to avoid the kernel freeing it via the + * kmemleak_free() callback. This is less heavyweight than holding a global + * lock like kmemleak_lock during scanning. * - scan_mutex (mutex): ensures that only one thread may scan the memory for * unreferenced objects at a time. The gray_list contains the objects which * are already referenced or marked as false positives and need to be @@ -135,7 +135,7 @@ struct kmemleak_scan_area { * (use_count) and freed using the RCU mechanism. */ struct kmemleak_object { - spinlock_t lock; + raw_spinlock_t lock; unsigned int flags; /* object status flags */ struct list_head object_list; struct list_head gray_list; @@ -191,8 +191,8 @@ static int mem_pool_free_count = ARRAY_SIZE(mem_pool); static LIST_HEAD(mem_pool_free_list); /* search tree for object boundaries */ static struct rb_root object_tree_root = RB_ROOT; -/* rw_lock protecting the access to object_list and object_tree_root */ -static DEFINE_RWLOCK(kmemleak_lock); +/* protecting the access to object_list and object_tree_root */ +static DEFINE_RAW_SPINLOCK(kmemleak_lock); /* allocation caches for kmemleak internal data */ static struct kmem_cache *object_cache; @@ -426,7 +426,7 @@ static struct kmemleak_object *mem_pool_alloc(gfp_t gfp) } /* slab allocation failed, try the memory pool */ - write_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); object = list_first_entry_or_null(&mem_pool_free_list, typeof(*object), object_list); if (object) @@ -435,7 +435,7 @@ static struct kmemleak_object *mem_pool_alloc(gfp_t gfp) object = &mem_pool[--mem_pool_free_count]; else pr_warn_once("Memory pool empty, consider increasing CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE\n"); - write_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); return object; } @@ -453,9 +453,9 @@ static void mem_pool_free(struct kmemleak_object *object) } /* add the object to the memory pool free list */ - write_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); list_add(&object->object_list, &mem_pool_free_list); - write_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); } /* @@ -514,9 +514,9 @@ static struct kmemleak_object *find_and_get_object(unsigned long ptr, int alias) struct kmemleak_object *object; rcu_read_lock(); - read_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); object = lookup_object(ptr, alias); - read_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); /* check whether the object is still available */ if (object && !get_object(object)) @@ -546,11 +546,11 @@ static struct kmemleak_object *find_and_remove_object(unsigned long ptr, int ali unsigned long flags; struct kmemleak_object *object; - write_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); object = lookup_object(ptr, alias); if (object) __remove_object(object); - write_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); return object; } @@ -585,7 +585,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size, INIT_LIST_HEAD(&object->object_list); INIT_LIST_HEAD(&object->gray_list); INIT_HLIST_HEAD(&object->area_list); - spin_lock_init(&object->lock); + raw_spin_lock_init(&object->lock); atomic_set(&object->use_count, 1); object->flags = OBJECT_ALLOCATED; object->pointer = ptr; @@ -617,7 +617,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size, /* kernel backtrace */ object->trace_len = __save_stack_trace(object->trace); - write_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); untagged_ptr = (unsigned long)kasan_reset_tag((void *)ptr); min_addr = min(min_addr, untagged_ptr); @@ -649,7 +649,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size, list_add_tail_rcu(&object->object_list, &object_list); out: - write_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); return object; } @@ -667,9 +667,9 @@ static void __delete_object(struct kmemleak_object *object) * Locking here also ensures that the corresponding memory block * cannot be freed when it is being scanned. */ - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); object->flags &= ~OBJECT_ALLOCATED; - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); } @@ -739,9 +739,9 @@ static void paint_it(struct kmemleak_object *object, int color) { unsigned long flags; - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); __paint_it(object, color); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } static void paint_ptr(unsigned long ptr, int color) @@ -798,7 +798,7 @@ static void add_scan_area(unsigned long ptr, size_t size, gfp_t gfp) if (scan_area_cache) area = kmem_cache_alloc(scan_area_cache, gfp_kmemleak_mask(gfp)); - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if (!area) { pr_warn_once("Cannot allocate a scan area, scanning the full object\n"); /* mark the object for full scan to avoid false positives */ @@ -820,7 +820,7 @@ static void add_scan_area(unsigned long ptr, size_t size, gfp_t gfp) hlist_add_head(&area->node, &object->area_list); out_unlock: - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); } @@ -842,9 +842,9 @@ static void object_set_excess_ref(unsigned long ptr, unsigned long excess_ref) return; } - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); object->excess_ref = excess_ref; - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); } @@ -864,9 +864,9 @@ static void object_no_scan(unsigned long ptr) return; } - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); object->flags |= OBJECT_NO_SCAN; - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); } @@ -1026,9 +1026,9 @@ void __ref kmemleak_update_trace(const void *ptr) return; } - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); object->trace_len = __save_stack_trace(object->trace); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); } @@ -1233,7 +1233,7 @@ static void scan_block(void *_start, void *_end, unsigned long flags; unsigned long untagged_ptr; - read_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); for (ptr = start; ptr < end; ptr++) { struct kmemleak_object *object; unsigned long pointer; @@ -1268,7 +1268,7 @@ static void scan_block(void *_start, void *_end, * previously acquired in scan_object(). These locks are * enclosed by scan_mutex. */ - spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); + raw_spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); /* only pass surplus references (object already gray) */ if (color_gray(object)) { excess_ref = object->excess_ref; @@ -1277,7 +1277,7 @@ static void scan_block(void *_start, void *_end, excess_ref = 0; update_refs(object); } - spin_unlock(&object->lock); + raw_spin_unlock(&object->lock); if (excess_ref) { object = lookup_object(excess_ref, 0); @@ -1286,12 +1286,12 @@ static void scan_block(void *_start, void *_end, if (object == scanned) /* circular reference, ignore */ continue; - spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); + raw_spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); update_refs(object); - spin_unlock(&object->lock); + raw_spin_unlock(&object->lock); } } - read_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); } /* @@ -1324,7 +1324,7 @@ static void scan_object(struct kmemleak_object *object) * Once the object->lock is acquired, the corresponding memory block * cannot be freed (the same lock is acquired in delete_object). */ - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if (object->flags & OBJECT_NO_SCAN) goto out; if (!(object->flags & OBJECT_ALLOCATED)) @@ -1344,9 +1344,9 @@ static void scan_object(struct kmemleak_object *object) if (start >= end) break; - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); cond_resched(); - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); } while (object->flags & OBJECT_ALLOCATED); } else hlist_for_each_entry(area, &object->area_list, node) @@ -1354,7 +1354,7 @@ static void scan_object(struct kmemleak_object *object) (void *)(area->start + area->size), object); out: - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } /* @@ -1407,7 +1407,7 @@ static void kmemleak_scan(void) /* prepare the kmemleak_object's */ rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); #ifdef DEBUG /* * With a few exceptions there should be a maximum of @@ -1424,7 +1424,7 @@ static void kmemleak_scan(void) if (color_gray(object) && get_object(object)) list_add_tail(&object->gray_list, &gray_list); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } rcu_read_unlock(); @@ -1492,14 +1492,14 @@ static void kmemleak_scan(void) */ rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if (color_white(object) && (object->flags & OBJECT_ALLOCATED) && update_checksum(object) && get_object(object)) { /* color it gray temporarily */ object->count = object->min_count; list_add_tail(&object->gray_list, &gray_list); } - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } rcu_read_unlock(); @@ -1519,7 +1519,7 @@ static void kmemleak_scan(void) */ rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if (unreferenced_object(object) && !(object->flags & OBJECT_REPORTED)) { object->flags |= OBJECT_REPORTED; @@ -1529,7 +1529,7 @@ static void kmemleak_scan(void) new_leaks++; } - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } rcu_read_unlock(); @@ -1681,10 +1681,10 @@ static int kmemleak_seq_show(struct seq_file *seq, void *v) struct kmemleak_object *object = v; unsigned long flags; - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if ((object->flags & OBJECT_REPORTED) && unreferenced_object(object)) print_unreferenced(seq, object); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); return 0; } @@ -1714,9 +1714,9 @@ static int dump_str_object_info(const char *str) return -EINVAL; } - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); dump_object_info(object); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); return 0; @@ -1735,11 +1735,11 @@ static void kmemleak_clear(void) rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if ((object->flags & OBJECT_REPORTED) && unreferenced_object(object)) __paint_it(object, KMEMLEAK_GREY); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } rcu_read_unlock(); diff --git a/mm/memblock.c b/mm/memblock.c index 4bc2c7d8bf42..eba94ee3de0b 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -575,7 +575,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type, * Return: * 0 on success, -errno on failure. */ -int __init_memblock memblock_add_range(struct memblock_type *type, +static int __init_memblock memblock_add_range(struct memblock_type *type, phys_addr_t base, phys_addr_t size, int nid, enum memblock_flags flags) { @@ -694,7 +694,7 @@ int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size) { phys_addr_t end = base + size - 1; - memblock_dbg("memblock_add: [%pa-%pa] %pS\n", + memblock_dbg("%s: [%pa-%pa] %pS\n", __func__, &base, &end, (void *)_RET_IP_); return memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0); @@ -795,7 +795,7 @@ int __init_memblock memblock_remove(phys_addr_t base, phys_addr_t size) { phys_addr_t end = base + size - 1; - memblock_dbg("memblock_remove: [%pa-%pa] %pS\n", + memblock_dbg("%s: [%pa-%pa] %pS\n", __func__, &base, &end, (void *)_RET_IP_); return memblock_remove_range(&memblock.memory, base, size); @@ -813,7 +813,7 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size) { phys_addr_t end = base + size - 1; - memblock_dbg(" memblock_free: [%pa-%pa] %pS\n", + memblock_dbg("%s: [%pa-%pa] %pS\n", __func__, &base, &end, (void *)_RET_IP_); kmemleak_free_part_phys(base, size); @@ -824,12 +824,24 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) { phys_addr_t end = base + size - 1; - memblock_dbg("memblock_reserve: [%pa-%pa] %pS\n", + memblock_dbg("%s: [%pa-%pa] %pS\n", __func__, &base, &end, (void *)_RET_IP_); return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0); } +#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP +int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size) +{ + phys_addr_t end = base + size - 1; + + memblock_dbg("%s: [%pa-%pa] %pS\n", __func__, + &base, &end, (void *)_RET_IP_); + + return memblock_add_range(&memblock.physmem, base, size, MAX_NUMNODES, 0); +} +#endif + /** * memblock_setclr_flag - set or clear flag for a memory region * @base: base address of the region diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6c83cf4ed970..6f6dc8712e39 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5340,14 +5340,6 @@ static int mem_cgroup_move_account(struct page *page, __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages); } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (compound && !list_empty(page_deferred_list(page))) { - spin_lock(&from->deferred_split_queue.split_queue_lock); - list_del_init(page_deferred_list(page)); - from->deferred_split_queue.split_queue_len--; - spin_unlock(&from->deferred_split_queue.split_queue_lock); - } -#endif /* * It is safe to change page->mem_cgroup here because the page * is referenced, charged, and isolated - we can't race with @@ -5357,16 +5349,6 @@ static int mem_cgroup_move_account(struct page *page, /* caller should have done css_get */ page->mem_cgroup = to; -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (compound && list_empty(page_deferred_list(page))) { - spin_lock(&to->deferred_split_queue.split_queue_lock); - list_add_tail(page_deferred_list(page), - &to->deferred_split_queue.split_queue); - to->deferred_split_queue.split_queue_len++; - spin_unlock(&to->deferred_split_queue.split_queue_lock); - } -#endif - spin_unlock_irqrestore(&from->move_lock, flags); ret = 0; @@ -6651,7 +6633,6 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) { struct mem_cgroup *memcg; unsigned int nr_pages; - bool compound; unsigned long flags; VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage); @@ -6673,8 +6654,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) return; /* Force-charge the new page. The old one will be freed soon */ - compound = PageTransHuge(newpage); - nr_pages = compound ? hpage_nr_pages(newpage) : 1; + nr_pages = hpage_nr_pages(newpage); page_counter_charge(&memcg->memory, nr_pages); if (do_memsw_account()) @@ -6684,7 +6664,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) commit_charge(newpage, memcg, false); local_irq_save(flags); - mem_cgroup_charge_statistics(memcg, newpage, compound, nr_pages); + mem_cgroup_charge_statistics(memcg, newpage, PageTransHuge(newpage), + nr_pages); memcg_check_events(memcg, newpage); local_irq_restore(flags); } diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index a91a072f2b2c..36d80915ddc2 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -783,27 +783,18 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn, return default_zone_for_pfn(nid, start_pfn, nr_pages); } -int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type) +int __ref online_pages(unsigned long pfn, unsigned long nr_pages, + int online_type, int nid) { unsigned long flags; unsigned long onlined_pages = 0; struct zone *zone; int need_zonelists_rebuild = 0; - int nid; int ret; struct memory_notify arg; - struct memory_block *mem; mem_hotplug_begin(); - /* - * We can't use pfn_to_nid() because nid might be stored in struct page - * which is not yet initialized. Instead, we find nid from memory block. - */ - mem = find_memory_block(__pfn_to_section(pfn)); - nid = mem->nid; - put_device(&mem->dev); - /* associate pfn range with the zone */ zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages); move_pfn_range_to_zone(zone, pfn, nr_pages, NULL); @@ -1182,7 +1173,7 @@ static bool is_pageblock_removable_nolock(unsigned long pfn) if (!zone_spans_pfn(zone, pfn)) return false; - return !has_unmovable_pages(zone, page, 0, MIGRATE_MOVABLE, + return !has_unmovable_pages(zone, page, MIGRATE_MOVABLE, MEMORY_OFFLINE); } @@ -1764,8 +1755,6 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) BUG_ON(check_hotplug_memory_range(start, size)); - mem_hotplug_begin(); - /* * All memory blocks must be offlined before removing memory. Check * whether all memory blocks in question are offline and return error @@ -1778,9 +1767,14 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); - /* remove memory block devices before removing memory */ + /* + * Memory block device removal under the device_hotplug_lock is + * a barrier against racing online attempts. + */ remove_memory_block_devices(start, size); + mem_hotplug_begin(); + arch_remove_memory(nid, start, size, NULL); memblock_free(start, size); memblock_remove(start, size); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b2920ae87a61..977c641f78cf 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2821,6 +2821,9 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) char *flags = strchr(str, '='); int err = 1, mode; + if (flags) + *flags++ = '\0'; /* terminate mode string */ + if (nodelist) { /* NUL-terminate mode or flags string */ *nodelist++ = '\0'; @@ -2831,9 +2834,6 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } else nodes_clear(nodes); - if (flags) - *flags++ = '\0'; /* terminate mode string */ - mode = match_string(policy_modes, MPOL_MAX, str); if (mode < 0) goto out; diff --git a/mm/memremap.c b/mm/memremap.c index c51c6bd2fe34..4c723d2049d5 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -27,7 +27,8 @@ static void devmap_managed_enable_put(void) static int devmap_managed_enable_get(struct dev_pagemap *pgmap) { - if (!pgmap->ops || !pgmap->ops->page_free) { + if (pgmap->type == MEMORY_DEVICE_PRIVATE && + (!pgmap->ops || !pgmap->ops->page_free)) { WARN(1, "Missing page_free method\n"); return -EINVAL; } @@ -410,48 +411,42 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, EXPORT_SYMBOL_GPL(get_dev_pagemap); #ifdef CONFIG_DEV_PAGEMAP_OPS -void __put_devmap_managed_page(struct page *page) +void free_devmap_managed_page(struct page *page) { - int count = page_ref_dec_return(page); - - /* - * If refcount is 1 then page is freed and refcount is stable as nobody - * holds a reference on the page. - */ - if (count == 1) { - /* Clear Active bit in case of parallel mark_page_accessed */ - __ClearPageActive(page); - __ClearPageWaiters(page); + /* notify page idle for dax */ + if (!is_device_private_page(page)) { + wake_up_var(&page->_refcount); + return; + } - mem_cgroup_uncharge(page); + /* Clear Active bit in case of parallel mark_page_accessed */ + __ClearPageActive(page); + __ClearPageWaiters(page); - /* - * When a device_private page is freed, the page->mapping field - * may still contain a (stale) mapping value. For example, the - * lower bits of page->mapping may still identify the page as - * an anonymous page. Ultimately, this entire field is just - * stale and wrong, and it will cause errors if not cleared. - * One example is: - * - * migrate_vma_pages() - * migrate_vma_insert_page() - * page_add_new_anon_rmap() - * __page_set_anon_rmap() - * ...checks page->mapping, via PageAnon(page) call, - * and incorrectly concludes that the page is an - * anonymous page. Therefore, it incorrectly, - * silently fails to set up the new anon rmap. - * - * For other types of ZONE_DEVICE pages, migration is either - * handled differently or not done at all, so there is no need - * to clear page->mapping. - */ - if (is_device_private_page(page)) - page->mapping = NULL; + mem_cgroup_uncharge(page); - page->pgmap->ops->page_free(page); - } else if (!count) - __put_page(page); + /* + * When a device_private page is freed, the page->mapping field + * may still contain a (stale) mapping value. For example, the + * lower bits of page->mapping may still identify the page as an + * anonymous page. Ultimately, this entire field is just stale + * and wrong, and it will cause errors if not cleared. One + * example is: + * + * migrate_vma_pages() + * migrate_vma_insert_page() + * page_add_new_anon_rmap() + * __page_set_anon_rmap() + * ...checks page->mapping, via PageAnon(page) call, + * and incorrectly concludes that the page is an + * anonymous page. Therefore, it incorrectly, + * silently fails to set up the new anon rmap. + * + * For other types of ZONE_DEVICE pages, migration is either + * handled differently or not done at all, so there is no need + * to clear page->mapping. + */ + page->mapping = NULL; + page->pgmap->ops->page_free(page); } -EXPORT_SYMBOL(__put_devmap_managed_page); #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/migrate.c b/mm/migrate.c index 86873b6f38a7..edf42ed90030 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -48,6 +48,7 @@ #include <linux/page_owner.h> #include <linux/sched/mm.h> #include <linux/ptrace.h> +#include <linux/oom.h> #include <asm/tlbflush.h> @@ -986,7 +987,7 @@ static int move_to_new_page(struct page *newpage, struct page *page, } /* - * Anonymous and movable page->mapping will be cleard by + * Anonymous and movable page->mapping will be cleared by * free_pages_prepare so don't reset it here for keeping * the type to work PageAnon, for example. */ @@ -1199,8 +1200,7 @@ out: /* * A page that has been migrated has all references * removed and will be freed. A page that has not been - * migrated will have kepts its references and be - * restored. + * migrated will have kept its references and be restored. */ list_del(&page->lru); @@ -1627,8 +1627,19 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, start = i; } else if (node != current_node) { err = do_move_pages_to_node(mm, &pagelist, current_node); - if (err) + if (err) { + /* + * Positive err means the number of failed + * pages to migrate. Since we are going to + * abort and return the number of non-migrated + * pages, so need to incude the rest of the + * nr_pages that have not been attempted as + * well. + */ + if (err > 0) + err += nr_pages - i - 1; goto out; + } err = store_status(status, start, current_node, i - start); if (err) goto out; @@ -1659,8 +1670,11 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, goto out_flush; err = do_move_pages_to_node(mm, &pagelist, current_node); - if (err) + if (err) { + if (err > 0) + err += nr_pages - i - 1; goto out; + } if (i > start) { err = store_status(status, start, current_node, i - start); if (err) @@ -1674,9 +1688,16 @@ out_flush: /* Make sure we do not overwrite the existing error */ err1 = do_move_pages_to_node(mm, &pagelist, current_node); + /* + * Don't have to report non-attempted pages here since: + * - If the above loop is done gracefully all pages have been + * attempted. + * - If the above loop is aborted it means a fatal error + * happened, should return ret. + */ if (!err1) err1 = store_status(status, start, current_node, i - start); - if (!err) + if (err >= 0) err = err1; out: return err; @@ -2135,7 +2156,7 @@ static int migrate_vma_collect_hole(unsigned long start, struct migrate_vma *migrate = walk->private; unsigned long addr; - for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) { + for (addr = start; addr < end; addr += PAGE_SIZE) { migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE; migrate->dst[migrate->npages] = 0; migrate->npages++; @@ -2152,7 +2173,7 @@ static int migrate_vma_collect_skip(unsigned long start, struct migrate_vma *migrate = walk->private; unsigned long addr; - for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) { + for (addr = start; addr < end; addr += PAGE_SIZE) { migrate->dst[migrate->npages] = 0; migrate->src[migrate->npages++] = 0; } @@ -2675,6 +2696,14 @@ int migrate_vma_setup(struct migrate_vma *args) } EXPORT_SYMBOL(migrate_vma_setup); +/* + * This code closely matches the code in: + * __handle_mm_fault() + * handle_pte_fault() + * do_anonymous_page() + * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE + * private page. + */ static void migrate_vma_insert_page(struct migrate_vma *migrate, unsigned long addr, struct page *page, @@ -2755,30 +2784,24 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); + if (check_stable_address_space(mm)) + goto unlock_abort; + if (pte_present(*ptep)) { unsigned long pfn = pte_pfn(*ptep); - if (!is_zero_pfn(pfn)) { - pte_unmap_unlock(ptep, ptl); - mem_cgroup_cancel_charge(page, memcg, false); - goto abort; - } + if (!is_zero_pfn(pfn)) + goto unlock_abort; flush = true; - } else if (!pte_none(*ptep)) { - pte_unmap_unlock(ptep, ptl); - mem_cgroup_cancel_charge(page, memcg, false); - goto abort; - } + } else if (!pte_none(*ptep)) + goto unlock_abort; /* - * Check for usefaultfd but do not deliver the fault. Instead, + * Check for userfaultfd but do not deliver the fault. Instead, * just back off. */ - if (userfaultfd_missing(vma)) { - pte_unmap_unlock(ptep, ptl); - mem_cgroup_cancel_charge(page, memcg, false); - goto abort; - } + if (userfaultfd_missing(vma)) + goto unlock_abort; inc_mm_counter(mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, addr, false); @@ -2802,6 +2825,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, *src = MIGRATE_PFN_MIGRATE; return; +unlock_abort: + pte_unmap_unlock(ptep, ptl); + mem_cgroup_cancel_charge(page, memcg, false); abort: *src &= ~MIGRATE_PFN_MIGRATE; } @@ -2834,9 +2860,8 @@ void migrate_vma_pages(struct migrate_vma *migrate) } if (!page) { - if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) { + if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) continue; - } if (!notified) { notified = true; diff --git a/mm/mmap.c b/mm/mmap.c index bc788548c4e5..6756b8bb0033 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1270,26 +1270,22 @@ static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_ */ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) { - struct anon_vma *anon_vma; - struct vm_area_struct *near; - - near = vma->vm_next; - if (!near) - goto try_prev; - - anon_vma = reusable_anon_vma(near, vma, near); - if (anon_vma) - return anon_vma; -try_prev: - near = vma->vm_prev; - if (!near) - goto none; - - anon_vma = reusable_anon_vma(near, near, vma); - if (anon_vma) - return anon_vma; -none: + struct anon_vma *anon_vma = NULL; + + /* Try next first. */ + if (vma->vm_next) { + anon_vma = reusable_anon_vma(vma->vm_next, vma, vma->vm_next); + if (anon_vma) + return anon_vma; + } + + /* Try prev next. */ + if (vma->vm_prev) + anon_vma = reusable_anon_vma(vma->vm_prev, vma->vm_prev, vma); + /* + * We might reach here with anon_vma == NULL if we can't find + * any reusable anon_vma. * There's no absolute need to look only at touching neighbours: * we could search further afield for "compatible" anon_vmas. * But it would probably just be a waste of time searching, @@ -1297,7 +1293,7 @@ none: * We're trying to allow mprotect remerging later on, * not trying to minimize memory used for anon_vmas. */ - return NULL; + return anon_vma; } /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d58c481b3df8..dfc357614e56 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -26,6 +26,7 @@ #include <linux/sched/mm.h> #include <linux/sched/coredump.h> #include <linux/sched/task.h> +#include <linux/sched/debug.h> #include <linux/swap.h> #include <linux/timex.h> #include <linux/jiffies.h> @@ -620,6 +621,7 @@ static void oom_reap_task(struct task_struct *tsk) pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); + sched_show_task(tsk); debug_show_all_locks(); done: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d047bf7d8fd4..15e908ad933b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5848,6 +5848,30 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn) return false; } +#ifdef CONFIG_SPARSEMEM +/* Skip PFNs that belong to non-present sections */ +static inline __meminit unsigned long next_pfn(unsigned long pfn) +{ + unsigned long section_nr; + + section_nr = pfn_to_section_nr(++pfn); + if (present_section_nr(section_nr)) + return pfn; + + while (++section_nr <= __highest_present_section_nr) { + if (present_section_nr(section_nr)) + return section_nr_to_pfn(section_nr); + } + + return -1; +} +#else +static inline __meminit unsigned long next_pfn(unsigned long pfn) +{ + return pfn++; +} +#endif + /* * Initially all pages are reserved - free ones are freed * up by memblock_free_all() once the early boot process is @@ -5887,8 +5911,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, * function. They do not exist on hotplugged memory. */ if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn = next_pfn(pfn) - 1; continue; + } if (!early_pfn_in_nid(pfn, nid)) continue; if (overlap_memmap_init(zone, &pfn)) @@ -8154,20 +8180,22 @@ void *__init alloc_large_system_hash(const char *tablename, /* * This function checks whether pageblock includes unmovable pages or not. - * If @count is not zero, it is okay to include less @count unmovable pages * * PageLRU check without isolation or lru_lock could race so that * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable * check without lock_page also may miss some movable non-lru pages at * race condition. So you can't expect this function should be exact. + * + * Returns a page without holding a reference. If the caller wants to + * dereference that page (e.g., dumping), it has to make sure that that it + * cannot get removed (e.g., via memory unplug) concurrently. + * */ -bool has_unmovable_pages(struct zone *zone, struct page *page, int count, - int migratetype, int flags) +struct page *has_unmovable_pages(struct zone *zone, struct page *page, + int migratetype, int flags) { - unsigned long found; unsigned long iter = 0; unsigned long pfn = page_to_pfn(page); - const char *reason = "unmovable page"; /* * TODO we could make this much more efficient by not checking every @@ -8184,22 +8212,19 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, * so consider them movable here. */ if (is_migrate_cma(migratetype)) - return false; + return NULL; - reason = "CMA page"; - goto unmovable; + return page; } - for (found = 0; iter < pageblock_nr_pages; iter++) { - unsigned long check = pfn + iter; - - if (!pfn_valid_within(check)) + for (; iter < pageblock_nr_pages; iter++) { + if (!pfn_valid_within(pfn + iter)) continue; - page = pfn_to_page(check); + page = pfn_to_page(pfn + iter); if (PageReserved(page)) - goto unmovable; + return page; /* * If the zone is movable and we have ruled out all reserved @@ -8219,7 +8244,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, unsigned int skip_pages; if (!hugepage_migration_supported(page_hstate(head))) - goto unmovable; + return page; skip_pages = compound_nr(head) - (page - head); iter += skip_pages - 1; @@ -8245,11 +8270,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, if ((flags & MEMORY_OFFLINE) && PageHWPoison(page)) continue; - if (__PageMovable(page)) + if (__PageMovable(page) || PageLRU(page)) continue; - if (!PageLRU(page)) - found++; /* * If there are RECLAIMABLE pages, we need to check * it. But now, memory offline itself doesn't call @@ -8263,15 +8286,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, * is set to both of a memory hole page and a _used_ kernel * page at boot. */ - if (found > count) - goto unmovable; + return page; } - return false; -unmovable: - WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE); - if (flags & REPORT_FAILURE) - dump_page(pfn_to_page(pfn + iter), reason); - return true; + return NULL; } #ifdef CONFIG_CONTIG_ALLOC @@ -8675,10 +8692,6 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) BUG_ON(!PageBuddy(page)); order = page_order(page); offlined_pages += 1 << order; -#ifdef CONFIG_DEBUG_VM - pr_info("remove from free list %lx %d %lx\n", - pfn, 1 << order, end_pfn); -#endif del_page_from_free_area(page, &zone->free_area[order]); pfn += (1 << order); } diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 04ee1663cdbe..a9fd7c740c23 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -17,10 +17,9 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags) { + struct page *unmovable = NULL; struct zone *zone; - unsigned long flags, pfn; - struct memory_isolate_notify arg; - int notifier_ret; + unsigned long flags; int ret = -EBUSY; zone = page_zone(page); @@ -35,41 +34,12 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ if (is_migrate_isolate_page(page)) goto out; - pfn = page_to_pfn(page); - arg.start_pfn = pfn; - arg.nr_pages = pageblock_nr_pages; - arg.pages_found = 0; - - /* - * It may be possible to isolate a pageblock even if the - * migratetype is not MIGRATE_MOVABLE. The memory isolation - * notifier chain is used by balloon drivers to return the - * number of pages in a range that are held by the balloon - * driver to shrink memory. If all the pages are accounted for - * by balloons, are free, or on the LRU, isolation can continue. - * Later, for example, when memory hotplug notifier runs, these - * pages reported as "can be isolated" should be isolated(freed) - * by the balloon driver through the memory notifier chain. - */ - notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg); - notifier_ret = notifier_to_errno(notifier_ret); - if (notifier_ret) - goto out; /* * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself. * We just check MOVABLE pages. */ - if (!has_unmovable_pages(zone, page, arg.pages_found, migratetype, - isol_flags)) - ret = 0; - - /* - * immobile means "not-on-lru" pages. If immobile is larger than - * removable-by-driver pages reported by notifier, we'll fail. - */ - -out: - if (!ret) { + unmovable = has_unmovable_pages(zone, page, migratetype, isol_flags); + if (!unmovable) { unsigned long nr_pages; int mt = get_pageblock_migratetype(page); @@ -79,11 +49,24 @@ out: NULL); __mod_zone_freepage_state(zone, -nr_pages, mt); + ret = 0; } +out: spin_unlock_irqrestore(&zone->lock, flags); - if (!ret) + if (!ret) { drain_all_pages(zone); + } else { + WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE); + + if ((isol_flags & REPORT_FAILURE) && unmovable) + /* + * printk() with zone->lock held will likely trigger a + * lockdep splat, so defer it here. + */ + dump_page(unmovable, "unmovable page"); + } + return ret; } diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index eff4b4520c8d..719c35246cfa 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -52,12 +52,16 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw) return true; } -static inline bool pfn_in_hpage(struct page *hpage, unsigned long pfn) +static inline bool pfn_is_match(struct page *page, unsigned long pfn) { - unsigned long hpage_pfn = page_to_pfn(hpage); + unsigned long page_pfn = page_to_pfn(page); + + /* normal page and hugetlbfs page */ + if (!PageTransCompound(page) || PageHuge(page)) + return page_pfn == pfn; /* THP can be referenced by any subpage */ - return pfn >= hpage_pfn && pfn - hpage_pfn < hpage_nr_pages(hpage); + return pfn >= page_pfn && pfn - page_pfn < hpage_nr_pages(page); } /** @@ -108,7 +112,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) pfn = pte_pfn(*pvmw->pte); } - return pfn_in_hpage(pvmw->page, pfn); + return pfn_is_match(pvmw->page, pfn); } /** diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7bef6c0..de41e830cdac 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages, if (copy > len) copy = len; - if (vm_write) { + if (vm_write) copied = copy_page_from_iter(page, offset, copy, iter); - set_page_dirty_lock(page); - } else { + else copied = copy_page_to_iter(page, offset, copy, iter); - } + len -= copied; if (copied < copy && iov_iter_count(iter)) return -EFAULT; @@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr, flags |= FOLL_WRITE; while (!rc && nr_pages && iov_iter_count(iter)) { - int pages = min(nr_pages, max_pages_per_loop); + int pinned_pages = min(nr_pages, max_pages_per_loop); int locked = 1; size_t bytes; @@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr, * current/current->mm */ down_read(&mm->mmap_sem); - pages = get_user_pages_remote(task, mm, pa, pages, flags, - process_pages, NULL, &locked); + pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages, + flags, process_pages, + NULL, &locked); if (locked) up_read(&mm->mmap_sem); - if (pages <= 0) + if (pinned_pages <= 0) return -EFAULT; - bytes = pages * PAGE_SIZE - start_offset; + bytes = pinned_pages * PAGE_SIZE - start_offset; if (bytes > len) bytes = len; @@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr, vm_write); len -= bytes; start_offset = 0; - nr_pages -= pages; - pa += pages * PAGE_SIZE; - while (pages) - put_page(process_pages[--pages]); + nr_pages -= pinned_pages; + pa += pinned_pages * PAGE_SIZE; + + /* If vm_write is set, the pages need to be made dirty: */ + unpin_user_pages_dirty_lock(process_pages, pinned_pages, + vm_write); } return rc; diff --git a/mm/slub.c b/mm/slub.c index 0ab92ec8c2a6..17dc00e33115 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -439,19 +439,38 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page, } #ifdef CONFIG_SLUB_DEBUG +static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)]; +static DEFINE_SPINLOCK(object_map_lock); + /* * Determine a map of object in use on a page. * * Node listlock must be held to guarantee that the page does * not vanish from under us. */ -static void get_map(struct kmem_cache *s, struct page *page, unsigned long *map) +static unsigned long *get_map(struct kmem_cache *s, struct page *page) { void *p; void *addr = page_address(page); + VM_BUG_ON(!irqs_disabled()); + + spin_lock(&object_map_lock); + + bitmap_zero(object_map, page->objects); + for (p = page->freelist; p; p = get_freepointer(s, p)) - set_bit(slab_index(p, s, addr), map); + set_bit(slab_index(p, s, addr), object_map); + + return object_map; +} + +static void put_map(unsigned long *map) +{ + VM_BUG_ON(map != object_map); + lockdep_assert_held(&object_map_lock); + + spin_unlock(&object_map_lock); } static inline unsigned int size_from_object(struct kmem_cache *s) @@ -3675,13 +3694,12 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page, #ifdef CONFIG_SLUB_DEBUG void *addr = page_address(page); void *p; - unsigned long *map = bitmap_zalloc(page->objects, GFP_ATOMIC); - if (!map) - return; + unsigned long *map; + slab_err(s, page, text, s->name); slab_lock(page); - get_map(s, page, map); + map = get_map(s, page); for_each_object(p, s, addr, page->objects) { if (!test_bit(slab_index(p, s, addr), map)) { @@ -3689,8 +3707,9 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page, print_tracking(s, p); } } + put_map(map); + slab_unlock(page); - bitmap_free(map); #endif } @@ -4384,19 +4403,19 @@ static int count_total(struct page *page) #endif #ifdef CONFIG_SLUB_DEBUG -static void validate_slab(struct kmem_cache *s, struct page *page, - unsigned long *map) +static void validate_slab(struct kmem_cache *s, struct page *page) { void *p; void *addr = page_address(page); + unsigned long *map; + + slab_lock(page); if (!check_slab(s, page) || !on_freelist(s, page, NULL)) - return; + goto unlock; /* Now we know that a valid freelist exists */ - bitmap_zero(map, page->objects); - - get_map(s, page, map); + map = get_map(s, page); for_each_object(p, s, addr, page->objects) { u8 val = test_bit(slab_index(p, s, addr), map) ? SLUB_RED_INACTIVE : SLUB_RED_ACTIVE; @@ -4404,18 +4423,13 @@ static void validate_slab(struct kmem_cache *s, struct page *page, if (!check_object(s, page, p, val)) break; } -} - -static void validate_slab_slab(struct kmem_cache *s, struct page *page, - unsigned long *map) -{ - slab_lock(page); - validate_slab(s, page, map); + put_map(map); +unlock: slab_unlock(page); } static int validate_slab_node(struct kmem_cache *s, - struct kmem_cache_node *n, unsigned long *map) + struct kmem_cache_node *n) { unsigned long count = 0; struct page *page; @@ -4424,7 +4438,7 @@ static int validate_slab_node(struct kmem_cache *s, spin_lock_irqsave(&n->list_lock, flags); list_for_each_entry(page, &n->partial, slab_list) { - validate_slab_slab(s, page, map); + validate_slab(s, page); count++; } if (count != n->nr_partial) @@ -4435,7 +4449,7 @@ static int validate_slab_node(struct kmem_cache *s, goto out; list_for_each_entry(page, &n->full, slab_list) { - validate_slab_slab(s, page, map); + validate_slab(s, page); count++; } if (count != atomic_long_read(&n->nr_slabs)) @@ -4452,15 +4466,11 @@ static long validate_slab_cache(struct kmem_cache *s) int node; unsigned long count = 0; struct kmem_cache_node *n; - unsigned long *map = bitmap_alloc(oo_objects(s->max), GFP_KERNEL); - - if (!map) - return -ENOMEM; flush_all(s); for_each_kmem_cache_node(s, node, n) - count += validate_slab_node(s, n, map); - bitmap_free(map); + count += validate_slab_node(s, n); + return count; } /* @@ -4590,18 +4600,17 @@ static int add_location(struct loc_track *t, struct kmem_cache *s, } static void process_slab(struct loc_track *t, struct kmem_cache *s, - struct page *page, enum track_item alloc, - unsigned long *map) + struct page *page, enum track_item alloc) { void *addr = page_address(page); void *p; + unsigned long *map; - bitmap_zero(map, page->objects); - get_map(s, page, map); - + map = get_map(s, page); for_each_object(p, s, addr, page->objects) if (!test_bit(slab_index(p, s, addr), map)) add_location(t, s, get_track(s, p, alloc)); + put_map(map); } static int list_locations(struct kmem_cache *s, char *buf, @@ -4612,11 +4621,9 @@ static int list_locations(struct kmem_cache *s, char *buf, struct loc_track t = { 0, 0, NULL }; int node; struct kmem_cache_node *n; - unsigned long *map = bitmap_alloc(oo_objects(s->max), GFP_KERNEL); - if (!map || !alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location), - GFP_KERNEL)) { - bitmap_free(map); + if (!alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location), + GFP_KERNEL)) { return sprintf(buf, "Out of memory\n"); } /* Push back cpu slabs */ @@ -4631,9 +4638,9 @@ static int list_locations(struct kmem_cache *s, char *buf, spin_lock_irqsave(&n->list_lock, flags); list_for_each_entry(page, &n->partial, slab_list) - process_slab(&t, s, page, alloc, map); + process_slab(&t, s, page, alloc); list_for_each_entry(page, &n->full, slab_list) - process_slab(&t, s, page, alloc, map); + process_slab(&t, s, page, alloc); spin_unlock_irqrestore(&n->list_lock, flags); } @@ -4682,7 +4689,6 @@ static int list_locations(struct kmem_cache *s, char *buf, } free_loc_track(&t); - bitmap_free(map); if (!t.count) len += sprintf(buf, "No data\n"); return len; diff --git a/mm/sparse.c b/mm/sparse.c index 3822ecbd8a1f..3918fc3eaef1 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -789,7 +789,7 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, ms->usage = NULL; } memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); - ms->section_mem_map = sparse_encode_mem_map(NULL, section_nr); + ms->section_mem_map = (unsigned long)NULL; } if (section_is_early && memmap) diff --git a/mm/swap.c b/mm/swap.c index 5341ae93861f..cf39d24ada2a 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -813,8 +813,10 @@ void release_pages(struct page **pages, int nr) * processing, and instead, expect a call to * put_page_testzero(). */ - if (put_devmap_managed_page(page)) + if (page_is_devmap_managed(page)) { + put_devmap_managed_page(page); continue; + } } page = compound_head(page); @@ -1102,3 +1104,26 @@ void __init swap_setup(void) * _really_ don't want to cluster much more */ } + +#ifdef CONFIG_DEV_PAGEMAP_OPS +void put_devmap_managed_page(struct page *page) +{ + int count; + + if (WARN_ON_ONCE(!page_is_devmap_managed(page))) + return; + + count = page_ref_dec_return(page); + + /* + * devmap page refcounts are 1-based, rather than 0-based: if + * refcount is 1, then the page is free and the refcount is + * stable because nobody holds a reference on the page. + */ + if (count == 1) + free_devmap_managed_page(page); + else if (!count) + __put_page(page); +} +EXPORT_SYMBOL(put_devmap_managed_page); +#endif diff --git a/mm/swapfile.c b/mm/swapfile.c index bb3261d45b6a..6febae9ad3cd 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2737,10 +2737,10 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos) else type = si->type + 1; + ++(*pos); for (; (si = swap_type_to_swap_info(type)); type++) { if (!(si->flags & SWP_USED) || !si->swap_map) continue; - ++*pos; return si; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 572fb17c6273..c05eb9efec07 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -146,20 +146,6 @@ struct scan_control { struct reclaim_state reclaim_state; }; -#ifdef ARCH_HAS_PREFETCH -#define prefetch_prev_lru_page(_page, _base, _field) \ - do { \ - if ((_page)->lru.prev != _base) { \ - struct page *prev; \ - \ - prev = lru_to_page(&(_page->lru)); \ - prefetch(&prev->_field); \ - } \ - } while (0) -#else -#define prefetch_prev_lru_page(_page, _base, _field) do { } while (0) -#endif - #ifdef ARCH_HAS_PREFETCHW #define prefetchw_prev_lru_page(_page, _base, _field) \ do { \ @@ -2695,7 +2681,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); } -static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) +static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) { struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_reclaimed, nr_scanned; @@ -2874,8 +2860,6 @@ again: */ if (reclaimable) pgdat->kswapd_failures = 0; - - return reclaimable; } /* @@ -4126,10 +4110,8 @@ module_init(kswapd_init) */ int node_reclaim_mode __read_mostly; -#define RECLAIM_OFF 0 -#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ -#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ +#define RECLAIM_WRITE (1<<0) /* Writeout pages during reclaim */ +#define RECLAIM_UNMAP (1<<1) /* Unmap pages during reclaim */ /* * Priority for NODE_RECLAIM. This determines the fraction of pages diff --git a/mm/zswap.c b/mm/zswap.c index 46a322316e52..55094e63b72d 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -32,6 +32,7 @@ #include <linux/swapops.h> #include <linux/writeback.h> #include <linux/pagemap.h> +#include <linux/workqueue.h> /********************************* * statistics @@ -65,6 +66,11 @@ static u64 zswap_reject_kmemcache_fail; /* Duplicate store was encountered (rare) */ static u64 zswap_duplicate_entry; +/* Shrinker work queue */ +static struct workqueue_struct *shrink_wq; +/* Pool limit was hit, we need to calm down */ +static bool zswap_pool_reached_full; + /********************************* * tunables **********************************/ @@ -109,6 +115,11 @@ module_param_cb(zpool, &zswap_zpool_param_ops, &zswap_zpool_type, 0644); static unsigned int zswap_max_pool_percent = 20; module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644); +/* The threshold for accepting new pages after the max_pool_percent was hit */ +static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */ +module_param_named(accept_threshold_percent, zswap_accept_thr_percent, + uint, 0644); + /* Enable/disable handling same-value filled pages (enabled by default) */ static bool zswap_same_filled_pages_enabled = true; module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled, @@ -123,7 +134,8 @@ struct zswap_pool { struct crypto_comp * __percpu *tfm; struct kref kref; struct list_head list; - struct work_struct work; + struct work_struct release_work; + struct work_struct shrink_work; struct hlist_node node; char tfm_name[CRYPTO_MAX_ALG_NAME]; }; @@ -214,6 +226,13 @@ static bool zswap_is_full(void) DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE); } +static bool zswap_can_accept(void) +{ + return totalram_pages() * zswap_accept_thr_percent / 100 * + zswap_max_pool_percent / 100 > + DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE); +} + static void zswap_update_total_size(void) { struct zswap_pool *pool; @@ -501,6 +520,16 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) return NULL; } +static void shrink_worker(struct work_struct *w) +{ + struct zswap_pool *pool = container_of(w, typeof(*pool), + shrink_work); + + if (zpool_shrink(pool->zpool, 1, NULL)) + zswap_reject_reclaim_fail++; + zswap_pool_put(pool); +} + static struct zswap_pool *zswap_pool_create(char *type, char *compressor) { struct zswap_pool *pool; @@ -551,6 +580,7 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) */ kref_init(&pool->kref); INIT_LIST_HEAD(&pool->list); + INIT_WORK(&pool->shrink_work, shrink_worker); zswap_pool_debug("created", pool); @@ -624,7 +654,8 @@ static int __must_check zswap_pool_get(struct zswap_pool *pool) static void __zswap_pool_release(struct work_struct *work) { - struct zswap_pool *pool = container_of(work, typeof(*pool), work); + struct zswap_pool *pool = container_of(work, typeof(*pool), + release_work); synchronize_rcu(); @@ -647,8 +678,8 @@ static void __zswap_pool_empty(struct kref *kref) list_del_rcu(&pool->list); - INIT_WORK(&pool->work, __zswap_pool_release); - schedule_work(&pool->work); + INIT_WORK(&pool->release_work, __zswap_pool_release); + schedule_work(&pool->release_work); spin_unlock(&zswap_pools_lock); } @@ -942,22 +973,6 @@ end: return ret; } -static int zswap_shrink(void) -{ - struct zswap_pool *pool; - int ret; - - pool = zswap_pool_last_get(); - if (!pool) - return -ENOENT; - - ret = zpool_shrink(pool->zpool, 1, NULL); - - zswap_pool_put(pool); - - return ret; -} - static int zswap_is_page_same_filled(void *ptr, unsigned long *value) { unsigned int pos; @@ -1011,21 +1026,23 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, /* reclaim space if needed */ if (zswap_is_full()) { + struct zswap_pool *pool; + zswap_pool_limit_hit++; - if (zswap_shrink()) { - zswap_reject_reclaim_fail++; - ret = -ENOMEM; - goto reject; - } + zswap_pool_reached_full = true; + pool = zswap_pool_last_get(); + if (pool) + queue_work(shrink_wq, &pool->shrink_work); + ret = -ENOMEM; + goto reject; + } - /* A second zswap_is_full() check after - * zswap_shrink() to make sure it's now - * under the max_pool_percent - */ - if (zswap_is_full()) { + if (zswap_pool_reached_full) { + if (!zswap_can_accept()) { ret = -ENOMEM; goto reject; - } + } else + zswap_pool_reached_full = false; } /* allocate entry */ @@ -1332,11 +1349,18 @@ static int __init init_zswap(void) zswap_enabled = false; } + shrink_wq = create_workqueue("zswap-shrink"); + if (!shrink_wq) + goto fallback_fail; + frontswap_register_ops(&zswap_frontswap_ops); if (zswap_debugfs_init()) pr_warn("debugfs initialization failed\n"); return 0; +fallback_fail: + if (pool) + zswap_pool_destroy(pool); hp_fail: cpuhp_remove_state(CPUHP_MM_ZSWP_MEM_PREPARE); dstmem_fail: diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index f93e917e0929..fa7bb5e060d0 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -212,7 +212,7 @@ static int xdp_umem_map_pages(struct xdp_umem *umem) static void xdp_umem_unpin_pages(struct xdp_umem *umem) { - put_user_pages_dirty_lock(umem->pgs, umem->npgs, true); + unpin_user_pages_dirty_lock(umem->pgs, umem->npgs, true); kfree(umem->pgs); umem->pgs = NULL; @@ -291,7 +291,7 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem) return -ENOMEM; down_read(¤t->mm->mmap_sem); - npgs = get_user_pages(umem->address, umem->npgs, + npgs = pin_user_pages(umem->address, umem->npgs, gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL); up_read(¤t->mm->mmap_sem); diff --git a/scripts/spelling.txt b/scripts/spelling.txt index 672b5931bc8d..ffa838f3a2b5 100644 --- a/scripts/spelling.txt +++ b/scripts/spelling.txt @@ -39,6 +39,8 @@ accout||account accquire||acquire accquired||acquired accross||across +accumalate||accumulate +accumalator||accumulator acessable||accessible acess||access acessing||accessing @@ -106,6 +108,7 @@ alogrithm||algorithm alot||a lot alow||allow alows||allows +alreay||already alredy||already altough||although alue||value @@ -241,6 +244,7 @@ calender||calendar calescing||coalescing calle||called callibration||calibration +callled||called calucate||calculate calulate||calculate cancelation||cancellation @@ -311,6 +315,7 @@ compaibility||compatibility comparsion||comparison compatability||compatibility compatable||compatible +compatibililty||compatibility compatibiliy||compatibility compatibilty||compatibility compatiblity||compatibility @@ -330,6 +335,7 @@ comunication||communication conbination||combination conditionaly||conditionally conditon||condition +condtion||condition conected||connected conector||connector connecetd||connected @@ -388,6 +394,8 @@ dafault||default deafult||default deamon||daemon debouce||debounce +decendant||descendant +decendants||descendants decompres||decompress decsribed||described decription||description @@ -411,11 +419,13 @@ delare||declare delares||declares delaring||declaring delemiter||delimiter +delievered||delivered demodualtor||demodulator demension||dimension dependancies||dependencies dependancy||dependency dependant||dependent +dependend||dependent depreacted||deprecated depreacte||deprecate desactivate||deactivate @@ -791,6 +801,7 @@ ireelevant||irrelevant irrelevent||irrelevant isnt||isn't isssue||issue +issus||issues iternations||iterations itertation||iteration itslef||itself @@ -995,6 +1006,7 @@ peice||piece pendantic||pedantic peprocessor||preprocessor perfoming||performing +perfomring||performing peripherial||peripheral permissons||permissions peroid||period @@ -1166,6 +1178,8 @@ retreive||retrieve retreiving||retrieving retrive||retrieve retrived||retrieved +retrun||return +retun||return retuned||returned reudce||reduce reuest||request diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c index 485cf06ef013..389327e9b30a 100644 --- a/tools/testing/selftests/vm/gup_benchmark.c +++ b/tools/testing/selftests/vm/gup_benchmark.c @@ -18,6 +18,9 @@ #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) #define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +/* Just the flags we need, copied from mm.h: */ +#define FOLL_WRITE 0x01 /* check pte is writable */ + struct gup_benchmark { __u64 get_delta_usec; __u64 put_delta_usec; @@ -85,7 +88,8 @@ int main(int argc, char **argv) } gup.nr_pages_per_call = nr_pages; - gup.flags = write; + if (write) + gup.flags |= FOLL_WRITE; fd = open("/sys/kernel/debug/gup_benchmark", O_RDWR); if (fd == -1) diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c index 68092d15e12b..9b68658b6bb8 100644 --- a/tools/vm/slabinfo.c +++ b/tools/vm/slabinfo.c @@ -720,11 +720,11 @@ static void slab_debug(struct slabinfo *s) return; if (sanity && !s->sanity_checks) { - set_obj(s, "sanity", 1); + set_obj(s, "sanity_checks", 1); } if (!sanity && s->sanity_checks) { if (slab_empty(s)) - set_obj(s, "sanity", 0); + set_obj(s, "sanity_checks", 0); else fprintf(stderr, "%s not empty cannot disable sanity checks\n", s->name); } |