summaryrefslogtreecommitdiff
path: root/kernel/dma
AgeCommit message (Collapse)AuthorFilesLines
2023-11-08swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMICPetr Tesarik1-1/+2
Limit the free list length to the size of the IO TLB. Transient pool can be smaller than IO_TLB_SEGSIZE, but the free list is initialized with the assumption that the total number of slots is a multiple of IO_TLB_SEGSIZE. As a result, swiotlb_area_find_slots() may allocate slots past the end of a transient IO TLB buffer. Reported-by: Niklas Schnelle <schnelle@linux.ibm.com> Closes: https://lore.kernel.org/linux-iommu/104a8c8fedffd1ff8a2890983e2ec1c26bff6810.camel@linux.ibm.com/ Fixes: 79636caad361 ("swiotlb: if swiotlb is full, fall back to a transient memory pool") Cc: stable@vger.kernel.org Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-11-06dma-mapping: fix dma_addressing_limited() if dma_range_map can't cover all ↵Jia He3-2/+50
system RAM There is an unusual case that the range map covers right up to the top of system RAM, but leaves a hole somewhere lower down. Then it prevents the nvme device dma mapping in the checking path of phys_to_dma() and causes the hangs at boot. E.g. On an Armv8 Ampere server, the dsdt ACPI table is: Method (_DMA, 0, Serialized) // _DMA: Direct Memory Access { Name (RBUF, ResourceTemplate () { QWordMemory (ResourceConsumer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, 0x0000000000000000, // Granularity 0x0000000000000000, // Range Minimum 0x00000000FFFFFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x0000000100000000, // Length ,, , AddressRangeMemory, TypeStatic) QWordMemory (ResourceConsumer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, 0x0000000000000000, // Granularity 0x0000006010200000, // Range Minimum 0x000000602FFFFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x000000001FE00000, // Length ,, , AddressRangeMemory, TypeStatic) QWordMemory (ResourceConsumer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, 0x0000000000000000, // Granularity 0x00000060F0000000, // Range Minimum 0x00000060FFFFFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x0000000010000000, // Length ,, , AddressRangeMemory, TypeStatic) QWordMemory (ResourceConsumer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, 0x0000000000000000, // Granularity 0x0000007000000000, // Range Minimum 0x000003FFFFFFFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x0000039000000000, // Length ,, , AddressRangeMemory, TypeStatic) }) But the System RAM ranges are: cat /proc/iomem |grep -i ram 90000000-91ffffff : System RAM 92900000-fffbffff : System RAM 880000000-fffffffff : System RAM 8800000000-bff5990fff : System RAM bff59d0000-bff5a4ffff : System RAM bff8000000-bfffffffff : System RAM So some RAM ranges are out of dma_range_map. Fix it by checking whether each of the system RAM resources can be properly encompassed within the dma_range_map. Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-11-06dma-mapping: move dma_addressing_limited() out of lineJia He1-0/+15
This patch moves dma_addressing_limited() out of line, serving as a preliminary step to prevent the introduction of a new publicly accessible low-level helper when validating whether all system RAM is mapped within the DMA mapping range. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-11-03swiotlb: do not free decrypted pages if dynamicPetr Tesarik1-9/+16
Fix these two error paths: 1. When set_memory_decrypted() fails, pages may be left fully or partially decrypted. 2. Decrypted pages may be freed if swiotlb_alloc_tlb() determines that the physical address is too high. To fix the first issue, call set_memory_encrypted() on the allocated region after a failed decryption attempt. If that also fails, leak the pages. To fix the second issue, check that the TLB physical address is below the requested limit before decrypting. Let the caller differentiate between unsuitable physical address (=> retry from a lower zone) and allocation failures (=> no point in retrying). Cc: stable@vger.kernel.org Fixes: 79636caad361 ("swiotlb: if swiotlb is full, fall back to a transient memory pool") Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-11-01Merge tag 'dma-mapping-6.7-2023-10-30' of ↵Linus Torvalds4-30/+32
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - get rid of the fake support for coherent DMA allocation on coldfire with caches (Christoph Hellwig) - add a few Kconfig dependencies so that Kconfig catches the use of invalid configurations (Christoph Hellwig) - fix a type in dma-debug output (Chuck Lever) - rewrite a comment in swiotlb (Sean Christopherson) * tag 'dma-mapping-6.7-2023-10-30' of git://git.infradead.org/users/hch/dma-mapping: dma-debug: Fix a typo in a debugging eye-catcher swiotlb: rewrite comment explaining why the source is preserved on DMA_FROM_DEVICE m68k: remove unused includes from dma.c m68k: don't provide arch_dma_alloc for nommu/coldfire net: fec: use dma_alloc_noncoherent for data cache enabled coldfire m68k: use the coherent DMA code for coldfire without data cache dma-direct: warn when coherent allocations aren't supported dma-direct: simplify the use atomic pool logic in dma_direct_alloc dma-direct: add a CONFIG_ARCH_HAS_DMA_ALLOC symbol dma-direct: add dependencies to CONFIG_DMA_GLOBAL_POOL
2023-10-25swiotlb: do not try to allocate a TLB bigger than MAX_ORDER pagesPetr Tesarik1-0/+5
When allocating a new pool at runtime, reduce the number of slabs so that the allocation order is at most MAX_ORDER. This avoids a kernel warning in __alloc_pages(). The warning is relatively benign, because the pool size is subsequently reduced when allocation fails, but it is silly to start with a request that is known to fail, especially since this is the default behavior if the kernel is built with CONFIG_SWIOTLB_DYNAMIC=y and booted without any swiotlb= parameter. Reported-by: Ben Greear <greearb@candelatech.com> Closes: https://lore.kernel.org/netdev/4f173dd2-324a-0240-ff8d-abf5c191be18@candelatech.com/ Fixes: 1aaa736815eb ("swiotlb: allocate a new memory pool when existing pools are full") Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-10-23dma-debug: Fix a typo in a debugging eye-catcherChuck Lever1-1/+1
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-10-23swiotlb: rewrite comment explaining why the source is preserved on ↵Sean Christopherson1-5/+7
DMA_FROM_DEVICE Rewrite the comment explaining why swiotlb copies the original buffer to the TLB buffer before initiating DMA *from* the device, i.e. before the device DMAs into the TLB buffer. The existing comment's argument that preserving the original data can prevent a kernel memory leak is bogus. If the driver that triggered the mapping _knows_ that the device will overwrite the entire mapping, or the driver will consume only the written parts, then copying from the original memory is completely pointless. If neither of the above holds true, then copying from the original adds value only if preserving the data is necessary for functional correctness, or the driver explicitly initialized the original memory. If the driver didn't initialize the memory, then copying the original buffer to the TLB buffer simply changes what kernel data is leaked to user space. Writing the entire TLB buffer _does_ prevent leaking stale TLB buffer data from a previous bounce, but that can be achieved by simply zeroing the TLB buffer when grabbing a slot. The real reason swiotlb ended up initializing the TLB buffer with the original buffer is that it's necessary to make swiotlb operate as transparently as possible, i.e. to behave as closely as possible to hardware, and to avoid corrupting the original buffer, e.g. if the driver knows the device will do partial writes and is relying on the unwritten data to be preserved. Reviewed-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/all/ZN5elYQ5szQndN8n@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-10-22dma-direct: warn when coherent allocations aren't supportedChristoph Hellwig1-1/+3
Log a warning once when dma_alloc_coherent fails because the platform does not support coherent allocations at all. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Tested-by: Greg Ungerer <gerg@linux-m68k.org>
2023-10-22dma-direct: simplify the use atomic pool logic in dma_direct_allocChristoph Hellwig1-15/+10
The logic in dma_direct_alloc when to use the atomic pool vs remapping grew a bit unreadable. Consolidate it into a single check, and clean up the set_uncached vs remap logic a bit as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Tested-by: Greg Ungerer <gerg@linux-m68k.org>
2023-10-22dma-direct: add a CONFIG_ARCH_HAS_DMA_ALLOC symbolChristoph Hellwig2-10/+11
Instead of using arch_dma_alloc if none of the generic coherent allocators are used, require the architectures to explicitly opt into providing it. This will used to deal with the case of m68knommu and coldfire where we can't do any coherent allocations whatsoever, and also makes it clear that arch_dma_alloc is a last resort. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Tested-by: Greg Ungerer <gerg@linux-m68k.org>
2023-10-22dma-direct: add dependencies to CONFIG_DMA_GLOBAL_POOLChristoph Hellwig1-0/+2
CONFIG_DMA_GLOBAL_POOL can't be combined with other DMA coherent allocators. Add dependencies to Kconfig to document this, and make kconfig complain about unmet dependencies if someone tries. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Tested-by: Greg Ungerer <gerg@linux-m68k.org>
2023-09-27swiotlb: fix the check whether a device has used software IO TLBPetr Tesarik1-6/+20
When CONFIG_SWIOTLB_DYNAMIC=y, devices which do not use the software IO TLB can avoid swiotlb lookup. A flag is added by commit 1395706a1490 ("swiotlb: search the software IO TLB only if the device makes use of it"), the flag is correctly set, but it is then never checked. Add the actual check here. Note that this code is an alternative to the default pool check, not an additional check, because: 1. swiotlb_find_pool() also searches the default pool; 2. if dma_uses_io_tlb is false, the default swiotlb pool is not used. Tested in a KVM guest against a QEMU RAM-backed SATA disk over virtio and *not* using software IO TLB, this patch increases IOPS by approx 2% for 4-way parallel I/O. The write memory barrier in swiotlb_dyn_alloc() is not needed, because a newly allocated pool must always be observed by swiotlb_find_slots() before an address from that pool is passed to is_swiotlb_buffer(). Correctness was verified using the following litmus test: C swiotlb-new-pool (* * Result: Never * * Check that a newly allocated pool is always visible when the * corresponding swiotlb buffer is visible. *) { mem_pools = default; } P0(int **mem_pools, int *pool) { /* add_mem_pool() */ WRITE_ONCE(*pool, 999); rcu_assign_pointer(*mem_pools, pool); } P1(int **mem_pools, int *flag, int *buf) { /* swiotlb_find_slots() */ int *r0; int r1; rcu_read_lock(); r0 = READ_ONCE(*mem_pools); r1 = READ_ONCE(*r0); rcu_read_unlock(); if (r1) { WRITE_ONCE(*flag, 1); smp_mb(); } /* device driver (presumed) */ WRITE_ONCE(*buf, r1); } P2(int **mem_pools, int *flag, int *buf) { /* device driver (presumed) */ int r0 = READ_ONCE(*buf); /* is_swiotlb_buffer() */ int r1; int *r2; int r3; smp_rmb(); r1 = READ_ONCE(*flag); if (r1) { /* swiotlb_find_pool() */ rcu_read_lock(); r2 = READ_ONCE(*mem_pools); r3 = READ_ONCE(*r2); rcu_read_unlock(); } } exists (2:r0<>0 /\ 2:r3=0) (* Not found. *) Fixes: 1395706a1490 ("swiotlb: search the software IO TLB only if the device makes use of it") Reported-by: Jonathan Corbet <corbet@lwn.net> Closes: https://lore.kernel.org/linux-iommu/87a5uz3ob8.fsf@meer.lwn.net/ Signed-off-by: Petr Tesarik <petr@tesarici.cz> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-09-13swiotlb: use the calculated number of areasRoss Lagerwall1-3/+2
Commit 8ac04063354a ("swiotlb: reduce the number of areas to match actual memory pool size") calculated the reduced number of areas in swiotlb_init_remap() but didn't actually use the value. Replace usage of default_nareas accordingly. Fixes: 8ac04063354a ("swiotlb: reduce the number of areas to match actual memory pool size") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-09-08Revert "dma-contiguous: check for memory region overlap"Zhenhua Huang1-5/+0
This reverts commit 3fa6456ebe13adab3ba1817c8e515a5b88f95dce. The Commit broke the CMA region creation through DT on arm64, as showed below logs with "memblock=debug": [ 0.000000] memblock_phys_alloc_range: 41943040 bytes align=0x200000 from=0x0000000000000000 max_addr=0x00000000ffffffff early_init_dt_alloc_reserved_memory_arch+0x34/0xa0 [ 0.000000] memblock_reserve: [0x00000000fd600000-0x00000000ffdfffff] memblock_alloc_range_nid+0xc0/0x19c [ 0.000000] Reserved memory: overlap with other memblock reserved region >From call flow, region we defined in DT was always reserved before entering into rmem_cma_setup. Also, rmem_cma_setup has one routine cma_init_reserved_mem to ensure the region was reserved. Checking the region not reserved here seems not correct. early_init_fdt_scan_reserved_mem: fdt_scan_reserved_mem __reserved_mem_reserve_reg early_init_dt_reserve_memory memblock_reserve(using “reg” prop case) fdt_init_reserved_mem __reserved_mem_alloc_size *early_init_dt_alloc_reserved_memory_arch* memblock_reserve(dynamic alloc case) __reserved_mem_init_node rmem_cma_setup(region overlap check here should always fail) Example DT can be used to reproduce issue: dump_mem: mem_dump_region { compatible = "shared-dma-pool"; alloc-ranges = <0x0 0x00000000 0x0 0xffffffff>; reusable; size = <0 0x2800000>; }; Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
2023-08-31dma-pool: remove a __maybe_unused label in atomic_pool_expandChristoph Hellwig1-2/+2
Move the #endif a line so that free_page label is only seen by the compile pass when actually used. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chunhui He <hchunhui@mail.ustc.edu.cn> Reviewed-by: Robin Murphy <roin.murphy@arm.com>
2023-08-30dma-contiguous: fix the Kconfig entry for CONFIG_DMA_NUMA_CMAChristoph Hellwig1-1/+1
It makes no sense to expose CONFIG_DMA_NUMA_CMA if CONFIG_NUMA is not enabled, and random config options shouldn't be default unless there is a good reason. Replace the default NUMA with a depends on to fix both issues. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <roin.murphy@arm.com>
2023-08-30dma-debug: don't call __dma_entry_alloc_check_leak() under free_entries_lockSergey Senozhatsky1-5/+15
__dma_entry_alloc_check_leak() calls into printk -> serial console output (qcom geni) and grabs port->lock under free_entries_lock spin lock, which is a reverse locking dependency chain as qcom_geni IRQ handler can call into dma-debug code and grab free_entries_lock under port->lock. Move __dma_entry_alloc_check_leak() call out of free_entries_lock scope so that we don't acquire serial console's port->lock under it. Trimmed-down lockdep splat: The existing dependency chain (in reverse order) is: -> #2 (free_entries_lock){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x60/0x80 dma_entry_alloc+0x38/0x110 debug_dma_map_page+0x60/0xf8 dma_map_page_attrs+0x1e0/0x230 dma_map_single_attrs.constprop.0+0x6c/0xc8 geni_se_rx_dma_prep+0x40/0xcc qcom_geni_serial_isr+0x310/0x510 __handle_irq_event_percpu+0x110/0x244 handle_irq_event_percpu+0x20/0x54 handle_irq_event+0x50/0x88 handle_fasteoi_irq+0xa4/0xcc handle_irq_desc+0x28/0x40 generic_handle_domain_irq+0x24/0x30 gic_handle_irq+0xc4/0x148 do_interrupt_handler+0xa4/0xb0 el1_interrupt+0x34/0x64 el1h_64_irq_handler+0x18/0x24 el1h_64_irq+0x64/0x68 arch_local_irq_enable+0x4/0x8 ____do_softirq+0x18/0x24 ... -> #1 (&port_lock_key){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x60/0x80 qcom_geni_serial_console_write+0x184/0x1dc console_flush_all+0x344/0x454 console_unlock+0x94/0xf0 vprintk_emit+0x238/0x24c vprintk_default+0x3c/0x48 vprintk+0xb4/0xbc _printk+0x68/0x90 register_console+0x230/0x38c uart_add_one_port+0x338/0x494 qcom_geni_serial_probe+0x390/0x424 platform_probe+0x70/0xc0 really_probe+0x148/0x280 __driver_probe_device+0xfc/0x114 driver_probe_device+0x44/0x100 __device_attach_driver+0x64/0xdc bus_for_each_drv+0xb0/0xd8 __device_attach+0xe4/0x140 device_initial_probe+0x1c/0x28 bus_probe_device+0x44/0xb0 device_add+0x538/0x668 of_device_add+0x44/0x50 of_platform_device_create_pdata+0x94/0xc8 of_platform_bus_create+0x270/0x304 of_platform_populate+0xac/0xc4 devm_of_platform_populate+0x60/0xac geni_se_probe+0x154/0x160 platform_probe+0x70/0xc0 ... -> #0 (console_owner){-...}-{0:0}: __lock_acquire+0xdf8/0x109c lock_acquire+0x234/0x284 console_flush_all+0x330/0x454 console_unlock+0x94/0xf0 vprintk_emit+0x238/0x24c vprintk_default+0x3c/0x48 vprintk+0xb4/0xbc _printk+0x68/0x90 dma_entry_alloc+0xb4/0x110 debug_dma_map_sg+0xdc/0x2f8 __dma_map_sg_attrs+0xac/0xe4 dma_map_sgtable+0x30/0x4c get_pages+0x1d4/0x1e4 [msm] msm_gem_pin_pages_locked+0x38/0xac [msm] msm_gem_pin_vma_locked+0x58/0x88 [msm] msm_ioctl_gem_submit+0xde4/0x13ac [msm] drm_ioctl_kernel+0xe0/0x15c drm_ioctl+0x2e8/0x3f4 vfs_ioctl+0x30/0x50 ... Chain exists of: console_owner --> &port_lock_key --> free_entries_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(free_entries_lock); lock(&port_lock_key); lock(free_entries_lock); lock(console_owner); *** DEADLOCK *** Call trace: dump_backtrace+0xb4/0xf0 show_stack+0x20/0x30 dump_stack_lvl+0x60/0x84 dump_stack+0x18/0x24 print_circular_bug+0x1cc/0x234 check_noncircular+0x78/0xac __lock_acquire+0xdf8/0x109c lock_acquire+0x234/0x284 console_flush_all+0x330/0x454 console_unlock+0x94/0xf0 vprintk_emit+0x238/0x24c vprintk_default+0x3c/0x48 vprintk+0xb4/0xbc _printk+0x68/0x90 dma_entry_alloc+0xb4/0x110 debug_dma_map_sg+0xdc/0x2f8 __dma_map_sg_attrs+0xac/0xe4 dma_map_sgtable+0x30/0x4c get_pages+0x1d4/0x1e4 [msm] msm_gem_pin_pages_locked+0x38/0xac [msm] msm_gem_pin_vma_locked+0x58/0x88 [msm] msm_ioctl_gem_submit+0xde4/0x13ac [msm] drm_ioctl_kernel+0xe0/0x15c drm_ioctl+0x2e8/0x3f4 vfs_ioctl+0x30/0x50 ... Reported-by: Rob Clark <robdclark@chromium.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-08swiotlb: optimize get_max_slots()Petr Tesarik1-3/+1
Use a simple logical shift and increment to calculate the number of slots taken by the DMA segment boundary. At least GCC-13 is not able to optimize the expression, producing this horrible assembly code on x86: cmpq $-1, %rcx je .L364 addq $2048, %rcx shrq $11, %rcx movq %rcx, %r13 .L331: // rest of the function here... // after function epilogue and return: .L364: movabsq $9007199254740992, %r13 jmp .L331 After the optimization, the code looks more reasonable: shrq $11, %r11 leaq 1(%r11), %rbx Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-08swiotlb: move slot allocation explanation comment where it belongsPetr Tesarik1-5/+5
Move the comment down in front of the loop that actually sets the list member of struct io_tlb_slot to zero. Fixes: 26a7e094783d ("swiotlb: refactor swiotlb_tbl_map_single") Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01swiotlb: search the software IO TLB only if the device makes use of itPetr Tesarik1-8/+6
Skip searching the software IO TLB if a device has never used it, making sure these devices are not affected by the introduction of multiple IO TLB memory pools. Additional memory barrier is required to ensure that the new value of the flag is visible to other CPUs after mapping a new bounce buffer. For efficiency, the flag check should be inlined, and then the memory barrier must be moved to is_swiotlb_buffer(). However, it can replace the existing barrier in swiotlb_find_pool(), because all callers use is_swiotlb_buffer() first to verify that the buffer address belongs to the software IO TLB. Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01swiotlb: allocate a new memory pool when existing pools are fullPetr Tesarik1-25/+123
When swiotlb_find_slots() cannot find suitable slots, schedule the allocation of a new memory pool. It is not possible to allocate the pool immediately, because this code may run in interrupt context, which is not suitable for large memory allocations. This means that the memory pool will be available too late for the currently requested mapping, but the stress on the software IO TLB allocator is likely to continue, and subsequent allocations will benefit from the additional pool eventually. Keep all memory pools for an allocator in an RCU list to avoid locking on the read side. For modifications, add a new spinlock to struct io_tlb_mem. The spinlock also protects updates to the total number of slabs (nslabs in struct io_tlb_mem), but not reads of the value. Readers may therefore encounter a stale value, but this is not an issue: - swiotlb_tbl_map_single() and is_swiotlb_active() only check for non-zero value. This is ensured by the existence of the default memory pool, allocated at boot. - The exact value is used only for non-critical purposes (debugfs, kernel messages). Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01swiotlb: determine potential physical address limitPetr Tesarik1-0/+14
The value returned by default_swiotlb_limit() should be constant, because it is used to decide whether DMA can be used. To allow allocating memory pools on the fly, use the maximum possible physical address rather than the highest address used by the default pool. For swiotlb_init_remap(), this is either an arch-specific limit used by memblock_alloc_low(), or the highest directly mapped physical address if the initialization flags include SWIOTLB_ANY. For swiotlb_init_late(), the highest address is determined by the GFP flags. Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01swiotlb: if swiotlb is full, fall back to a transient memory poolPetr Tesarik2-9/+309
Try to allocate a transient memory pool if no suitable slots can be found and the respective SWIOTLB is allowed to grow. The transient pool is just enough big for this one bounce buffer. It is inserted into a per-device list of transient memory pools, and it is freed again when the bounce buffer is unmapped. Transient memory pools are kept in an RCU list. A memory barrier is required after adding a new entry, because any address within a transient buffer must be immediately recognized as belonging to the SWIOTLB, even if it is passed to another CPU. Deletion does not require any synchronization beyond RCU ordering guarantees. After a buffer is unmapped, its physical addresses may no longer be passed to the DMA API, so the memory range of the corresponding stale entry in the RCU list never matches. If the memory range gets allocated again, then it happens only after a RCU quiescent state. Since bounce buffers can now be allocated from different pools, add a parameter to swiotlb_alloc_pool() to let the caller know which memory pool is used. Add swiotlb_find_pool() to find the memory pool corresponding to an address. This function is now also used by is_swiotlb_buffer(), because a simple boundary check is no longer sufficient. The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(), simplified and enhanced to use coherent memory pools if needed. Note that this is not the most efficient way to provide a bounce buffer, but when a DMA buffer can't be mapped, something may (and will) actually break. At that point it is better to make an allocation, even if it may be an expensive operation. Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01swiotlb: add a flag whether SWIOTLB is allowed to growPetr Tesarik2-0/+26
Add a config option (CONFIG_SWIOTLB_DYNAMIC) to enable or disable dynamic allocation of additional bounce buffers. If this option is set, mark the default SWIOTLB as able to grow and restricted DMA pools as unable. However, if the address of the default memory pool is explicitly queried, make the default SWIOTLB also unable to grow. This is currently used to set up PCI BAR movable regions on some Octeon MIPS boards which may not be able to use a SWIOTLB pool elsewhere in physical memory. See octeon_pci_setup() for more details. If a remap function is specified, it must be also called on any dynamically allocated pools, but there are some issues: - The remap function may block, so it should not be called from an atomic context. - There is no corresponding unremap() function if the memory pool is freed. - The only in-tree implementation (xen_swiotlb_fixup) requires that the number of slots in the memory pool is a multiple of SWIOTLB_SEGSIZE. Keep it simple for now and disable growing the SWIOTLB if a remap function was specified. Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01swiotlb: separate memory pool data from other allocator dataPetr Tesarik1-64/+111
Carve out memory pool specific fields from struct io_tlb_mem. The original struct now contains shared data for the whole allocator, while the new struct io_tlb_pool contains data that is specific to one memory pool of (potentially) many. Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01swiotlb: add documentation and rename swiotlb_do_find_slots()Petr Tesarik1-6/+55
Add some kernel-doc comments and move the existing documentation of struct io_tlb_slot to its correct location. The latter was forgotten in commit 942a8186eb445 ("swiotlb: move struct io_tlb_slot to swiotlb.c"). Use the opportunity to give swiotlb_do_find_slots() a more descriptive name and make it clear how it differs from swiotlb_find_slots(). Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01swiotlb: make io_tlb_default_mem local to swiotlb.cPetr Tesarik1-1/+38
SWIOTLB implementation details should not be exposed to the rest of the kernel. This will allow to make changes to the implementation without modifying non-swiotlb code. To avoid breaking existing users, provide helper functions for the few required fields. As a bonus, using a helper function to initialize struct device allows to get rid of an #ifdef in driver core. Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-08-01swiotlb: bail out of swiotlb_init_late() if swiotlb is already allocatedPetr Tesarik1-0/+3
If swiotlb is allocated, immediately return 0, so callers do not have to check io_tlb_default_mem.nslabs explicitly. Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31dma-contiguous: check for memory region overlapBinglei Wang1-0/+5
In the process of parsing the DTS, check whether the memory region specified by the DTS CMA node area overlaps with the kernel text memory space reserved by memblock before calling early_init_fdt_scan_reserved_mem. Signed-off-by: Binglei Wang <l3b2w1@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31dma-contiguous: support numa CMA for specified nodeYajun Deng2-26/+84
The kernel parameter 'cma_pernuma=' only supports reserving the same size of CMA area for each node. We need to reserve different sizes of CMA area for specified nodes if these devices belong to different nodes. Adding another kernel parameter 'numa_cma=' to reserve CMA area for the specified node. If we want to use one of these parameters, we need to enable DMA_NUMA_CMA. At the same time, print the node id in cma_declare_contiguous_nid() if CONFIG_NUMA is enabled. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31dma-contiguous: support per-numa CMA for all architecturesYajun Deng2-4/+10
In the commit b7176c261cdb ("dma-contiguous: provide the ability to reserve per-numa CMA"), Barry adds DMA_PERNUMA_CMA for ARM64. But this feature is architecture independent, so support per-numa CMA for all architectures, and enable it by default if NUMA. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31dma-mapping: move arch_dma_set_mask() declaration to headerArnd Bergmann1-6/+0
This function has a __weak definition and an override that is only used on freescale powerpc chips. The powerpc definition however does not see the declaration that is in a .c file: arch/powerpc/kernel/dma-mask.c:7:6: error: no previous prototype for 'arch_dma_set_mask' [-Werror=missing-prototypes] Move it into the linux/dma-map-ops.h header where the other arch_dma_* functions are declared. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-31swiotlb: unexport is_swiotlb_activeChristoph Hellwig1-1/+0
Drivers have no business looking at dma-mapping or swiotlb internals. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Juergen Gross <jgross@suse.com>
2023-07-09Merge tag 'dma-mapping-6.5-2023-07-09' of ↵Linus Torvalds1-11/+35
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - swiotlb area sizing fixes (Petr Tesarik) * tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: reduce the number of areas to match actual memory pool size swiotlb: always set the number of areas before allocating the pool
2023-06-29Merge tag 'dma-mapping-6.5-2023-06-28' of ↵Linus Torvalds3-3/+14
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - swiotlb cleanups (Petr Tesarik) - use kvmalloc_array (gaoxu) - a small step towards removing is_swiotlb_active (Christoph Hellwig) - fix a Kconfig typo Sui Jingfeng) * tag 'dma-mapping-6.5-2023-06-28' of git://git.infradead.org/users/hch/dma-mapping: drm/nouveau: stop using is_swiotlb_active swiotlb: use the atomic counter of total used slabs if available swiotlb: remove unused field "used" from struct io_tlb_mem dma-remap: use kvmalloc_array/kvfree for larger dma memory remap dma-mapping: fix a Kconfig typo
2023-06-29swiotlb: reduce the number of areas to match actual memory pool sizePetr Tesarik1-3/+24
Although the desired size of the SWIOTLB memory pool is increased in swiotlb_adjust_nareas() to match the number of areas, the actual allocation may be smaller, which may require reducing the number of areas. For example, Xen uses swiotlb_init_late(), which in turn uses the page allocator. On x86, page size is 4 KiB and MAX_ORDER is 10 (1024 pages), resulting in a maximum memory pool size of 4 MiB. This corresponds to 2048 slots of 2 KiB each. The minimum area size is 128 (IO_TLB_SEGSIZE), allowing at most 2048 / 128 = 16 areas. If num_possible_cpus() is greater than the maximum number of areas, areas are smaller than IO_TLB_SEGSIZE and contiguous groups of free slots will span multiple areas. When allocating and freeing slots, only one area will be properly locked, causing race conditions on the unlocked slots and ultimately data corruption, kernel hangs and crashes. Fixes: 20347fca71a3 ("swiotlb: split up the global swiotlb lock") Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-06-29swiotlb: always set the number of areas before allocating the poolPetr Tesarik1-8/+11
The number of areas defaults to the number of possible CPUs. However, the total number of slots may have to be increased after adjusting the number of areas. Consequently, the number of areas must be determined before allocating the memory pool. This is even explained with a comment in swiotlb_init_remap(), but swiotlb_init_late() adjusts the number of areas after slots are already allocated. The areas may end up being smaller than IO_TLB_SEGSIZE, which breaks per-area locking. While fixing swiotlb_init_late(), move all relevant comments before the definition of swiotlb_adjust_nareas() and convert them to kernel-doc. Fixes: 20347fca71a3 ("swiotlb: split up the global swiotlb lock") Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-06-19dma-mapping: force bouncing if the kmalloc() size is not cache-line-alignedCatalin Marinas2-1/+6
For direct DMA, if the size is small enough to have originated from a kmalloc() cache below ARCH_DMA_MINALIGN, check its alignment against dma_get_cache_alignment() and bounce if necessary. For larger sizes, it is the responsibility of the DMA API caller to ensure proper alignment. At this point, the kmalloc() caches are properly aligned but this will change in a subsequent patch. Architectures can opt in by selecting DMA_BOUNCE_UNALIGNED_KMALLOC. Link: https://lkml.kernel.org/r/20230612153201.554742-15-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jerry Snitselaar <jsnitsel@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19dma-mapping: name SG DMA flag helpers consistentlyRobin Murphy1-1/+1
sg_is_dma_bus_address() is inconsistent with the naming pattern of its corresponding setters and its own kerneldoc, so take the majority vote and rename it sg_dma_is_bus_address() (and fix up the missing underscores in the kerneldoc too). This gives us a nice clear pattern where SG DMA flags are SG_DMA_<NAME>, and the helpers for acting on them are sg_dma_<action>_<name>(). Link: https://lkml.kernel.org/r/20230612153201.554742-14-catalin.marinas@arm.com Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Link: https://lore.kernel.org/r/fa2eca2862c7ffc41b50337abffb2dfd2864d3ea.1685036694.git.robin.murphy@arm.com Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19scatterlist: add dedicated config for DMA flagsRobin Murphy1-0/+3
The DMA flags field will be useful for users beyond PCI P2P, so upgrade to its own dedicated config option. [catalin.marinas@arm.com: use #ifdef CONFIG_NEED_SG_DMA_FLAGS in scatterlist.h] [catalin.marinas@arm.com: update PCI_P2PDMA dma_flags comment in scatterlist.h] Link: https://lkml.kernel.org/r/20230612153201.554742-13-catalin.marinas@arm.com Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jerry Snitselaar <jsnitsel@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-07swiotlb: use the atomic counter of total used slabs if availablePetr Tesarik1-0/+11
If DEBUG_FS is enabled, the cost of keeping an exact number of total used slabs is already paid. In this case, there is no reason to use an inexact number for statistics and kernel messages. Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-06-07dma-remap: use kvmalloc_array/kvfree for larger dma memory remapgaoxu1-2/+2
If dma_direct_alloc() alloc memory in size of 64MB, the inner function dma_common_contiguous_remap() will allocate 128KB memory by invoking the function kmalloc_array(). and the kmalloc_array seems to fail to try to allocate 128KB mem. Call trace: [14977.928623] qcrosvm: page allocation failure: order:5, mode:0x40cc0 [14977.928638] dump_backtrace.cfi_jt+0x0/0x8 [14977.928647] dump_stack_lvl+0x80/0xb8 [14977.928652] warn_alloc+0x164/0x200 [14977.928657] __alloc_pages_slowpath+0x9f0/0xb4c [14977.928660] __alloc_pages+0x21c/0x39c [14977.928662] kmalloc_order+0x48/0x108 [14977.928666] kmalloc_order_trace+0x34/0x154 [14977.928668] __kmalloc+0x548/0x7e4 [14977.928673] dma_direct_alloc+0x11c/0x4f8 [14977.928678] dma_alloc_attrs+0xf4/0x138 [14977.928680] gh_vm_ioctl_set_fw_name+0x3c4/0x610 [gunyah] [14977.928698] gh_vm_ioctl+0x90/0x14c [gunyah] [14977.928705] __arm64_sys_ioctl+0x184/0x210 work around by doing kvmalloc_array instead. Signed-off-by: Gao Xu <gaoxu2@hihonor.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-06-07dma-mapping: fix a Kconfig typoSui Jingfeng1-1/+1
'thing' -> 'think' Signed-off-by: Sui Jingfeng <suijingfeng@loongson.cn> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-29Merge tag 'dma-mapping-6.4-2023-04-28' of ↵Linus Torvalds5-81/+175
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - fix a PageHighMem check in dma-coherent initialization (Doug Berger) - clean up the coherency defaul initialiation (Jiaxun Yang) - add cacheline to user/kernel dma-debug space dump messages (Desnes Nunes, Geert Uytterhoeve) - swiotlb statistics improvements (Michael Kelley) - misc cleanups (Petr Tesarik) * tag 'dma-mapping-6.4-2023-04-28' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: Omit total_used and used_hiwater if !CONFIG_DEBUG_FS swiotlb: track and report io_tlb_used high water marks in debugfs swiotlb: fix debugfs reporting of reserved memory pools swiotlb: relocate PageHighMem test away from rmem_swiotlb_setup of: address: always use dma_default_coherent for default coherency dma-mapping: provide CONFIG_ARCH_DMA_DEFAULT_COHERENT dma-mapping: provide a fallback dma_default_coherent dma-debug: Use %pa to format phys_addr_t dma-debug: add cacheline to user/kernel space dump messages dma-debug: small dma_debug_entry's comment and variable name updates dma-direct: cleanup parameters to dma_direct_optimal_gfp_mask
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-27Merge tag 'hyperv-next-signed-20230424' of ↵Linus Torvalds1-44/+1
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - PCI passthrough for Hyper-V confidential VMs (Michael Kelley) - Hyper-V VTL mode support (Saurabh Sengar) - Move panic report initialization code earlier (Long Li) - Various improvements and bug fixes (Dexuan Cui and Michael Kelley) * tag 'hyperv-next-signed-20230424' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (22 commits) PCI: hv: Replace retarget_msi_interrupt_params with hyperv_pcpu_input_arg Drivers: hv: move panic report code from vmbus to hv early init code x86/hyperv: VTL support for Hyper-V Drivers: hv: Kconfig: Add HYPERV_VTL_MODE x86/hyperv: Make hv_get_nmi_reason public x86/hyperv: Add VTL specific structs and hypercalls x86/init: Make get/set_rtc_noop() public x86/hyperv: Exclude lazy TLB mode CPUs from enlightened TLB flushes x86/hyperv: Add callback filter to cpumask_to_vpset() Drivers: hv: vmbus: Remove the per-CPU post_msg_page clocksource: hyper-v: make sure Invariant-TSC is used if it is available PCI: hv: Enable PCI pass-thru devices in Confidential VMs Drivers: hv: Don't remap addresses that are above shared_gpa_boundary hv_netvsc: Remove second mapping of send and recv buffers Drivers: hv: vmbus: Remove second way of mapping ring buffers Drivers: hv: vmbus: Remove second mapping of VMBus monitor pages swiotlb: Remove bounce buffer remapping for Hyper-V Driver: VMBus: Add Devicetree support dt-bindings: bus: Add Hyper-V VMBus Drivers: hv: vmbus: Convert acpi_device to more generic platform_device ...
2023-04-27Merge tag 'modules-6.4-rc1' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull module updates from Luis Chamberlain: "The summary of the changes for this pull requests is: - Song Liu's new struct module_memory replacement - Nick Alcock's MODULE_LICENSE() removal for non-modules - My cleanups and enhancements to reduce the areas where we vmalloc module memory for duplicates, and the respective debug code which proves the remaining vmalloc pressure comes from userspace. Most of the changes have been in linux-next for quite some time except the minor fixes I made to check if a module was already loaded prior to allocating the final module memory with vmalloc and the respective debug code it introduces to help clarify the issue. Although the functional change is small it is rather safe as it can only *help* reduce vmalloc space for duplicates and is confirmed to fix a bootup issue with over 400 CPUs with KASAN enabled. I don't expect stable kernels to pick up that fix as the cleanups would have also had to have been picked up. Folks on larger CPU systems with modules will want to just upgrade if vmalloc space has been an issue on bootup. Given the size of this request, here's some more elaborate details: The functional change change in this pull request is the very first patch from Song Liu which replaces the 'struct module_layout' with a new 'struct module_memory'. The old data structure tried to put together all types of supported module memory types in one data structure, the new one abstracts the differences in memory types in a module to allow each one to provide their own set of details. This paves the way in the future so we can deal with them in a cleaner way. If you look at changes they also provide a nice cleanup of how we handle these different memory areas in a module. This change has been in linux-next since before the merge window opened for v6.3 so to provide more than a full kernel cycle of testing. It's a good thing as quite a bit of fixes have been found for it. Jason Baron then made dynamic debug a first class citizen module user by using module notifier callbacks to allocate / remove module specific dynamic debug information. Nick Alcock has done quite a bit of work cross-tree to remove module license tags from things which cannot possibly be module at my request so to: a) help him with his longer term tooling goals which require a deterministic evaluation if a piece a symbol code could ever be part of a module or not. But quite recently it is has been made clear that tooling is not the only one that would benefit. Disambiguating symbols also helps efforts such as live patching, kprobes and BPF, but for other reasons and R&D on this area is active with no clear solution in sight. b) help us inch closer to the now generally accepted long term goal of automating all the MODULE_LICENSE() tags from SPDX license tags In so far as a) is concerned, although module license tags are a no-op for non-modules, tools which would want create a mapping of possible modules can only rely on the module license tag after the commit 8b41fc4454e ("kbuild: create modules.builtin without Makefile.modbuiltin or tristate.conf"). Nick has been working on this *for years* and AFAICT I was the only one to suggest two alternatives to this approach for tooling. The complexity in one of my suggested approaches lies in that we'd need a possible-obj-m and a could-be-module which would check if the object being built is part of any kconfig build which could ever lead to it being part of a module, and if so define a new define -DPOSSIBLE_MODULE [0]. A more obvious yet theoretical approach I've suggested would be to have a tristate in kconfig imply the same new -DPOSSIBLE_MODULE as well but that means getting kconfig symbol names mapping to modules always, and I don't think that's the case today. I am not aware of Nick or anyone exploring either of these options. Quite recently Josh Poimboeuf has pointed out that live patching, kprobes and BPF would benefit from resolving some part of the disambiguation as well but for other reasons. The function granularity KASLR (fgkaslr) patches were mentioned but Joe Lawrence has clarified this effort has been dropped with no clear solution in sight [1]. In the meantime removing module license tags from code which could never be modules is welcomed for both objectives mentioned above. Some developers have also welcomed these changes as it has helped clarify when a module was never possible and they forgot to clean this up, and so you'll see quite a bit of Nick's patches in other pull requests for this merge window. I just picked up the stragglers after rc3. LWN has good coverage on the motivation behind this work [2] and the typical cross-tree issues he ran into along the way. The only concrete blocker issue he ran into was that we should not remove the MODULE_LICENSE() tags from files which have no SPDX tags yet, even if they can never be modules. Nick ended up giving up on his efforts due to having to do this vetting and backlash he ran into from folks who really did *not understand* the core of the issue nor were providing any alternative / guidance. I've gone through his changes and dropped the patches which dropped the module license tags where an SPDX license tag was missing, it only consisted of 11 drivers. To see if a pull request deals with a file which lacks SPDX tags you can just use: ./scripts/spdxcheck.py -f \ $(git diff --name-only commid-id | xargs echo) You'll see a core module file in this pull request for the above, but that's not related to his changes. WE just need to add the SPDX license tag for the kernel/module/kmod.c file in the future but it demonstrates the effectiveness of the script. Most of Nick's changes were spread out through different trees, and I just picked up the slack after rc3 for the last kernel was out. Those changes have been in linux-next for over two weeks. The cleanups, debug code I added and final fix I added for modules were motivated by David Hildenbrand's report of boot failing on a systems with over 400 CPUs when KASAN was enabled due to running out of virtual memory space. Although the functional change only consists of 3 lines in the patch "module: avoid allocation if module is already present and ready", proving that this was the best we can do on the modules side took quite a bit of effort and new debug code. The initial cleanups I did on the modules side of things has been in linux-next since around rc3 of the last kernel, the actual final fix for and debug code however have only been in linux-next for about a week or so but I think it is worth getting that code in for this merge window as it does help fix / prove / evaluate the issues reported with larger number of CPUs. Userspace is not yet fixed as it is taking a bit of time for folks to understand the crux of the issue and find a proper resolution. Worst come to worst, I have a kludge-of-concept [3] of how to make kernel_read*() calls for modules unique / converge them, but I'm currently inclined to just see if userspace can fix this instead" Link: https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/ [0] Link: https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com [1] Link: https://lwn.net/Articles/927569/ [2] Link: https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org [3] * tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (121 commits) module: add debugging auto-load duplicate module support module: stats: fix invalid_mod_bytes typo module: remove use of uninitialized variable len module: fix building stats for 32-bit targets module: stats: include uapi/linux/module.h module: avoid allocation if module is already present and ready module: add debug stats to help identify memory pressure module: extract patient module check into helper modules/kmod: replace implementation with a semaphore Change DEFINE_SEMAPHORE() to take a number argument module: fix kmemleak annotations for non init ELF sections module: Ignore L0 and rename is_arm_mapping_symbol() module: Move is_arm_mapping_symbol() to module_symbol.h module: Sync code of is_arm_mapping_symbol() scripts/gdb: use mem instead of core_layout to get the module address interconnect: remove module-related code interconnect: remove MODULE_LICENSE in non-modules zswap: remove MODULE_LICENSE in non-modules zpool: remove MODULE_LICENSE in non-modules x86/mm/dump_pagetables: remove MODULE_LICENSE in non-modules ...
2023-04-20swiotlb: Omit total_used and used_hiwater if !CONFIG_DEBUG_FSPetr Tesarik1-3/+12
The tracking of used_hiwater adds an atomic operation to the hot path. This is acceptable only when debugging the kernel. To make sure that the fields can never be used by mistake, do not even include them in struct io_tlb_mem if CONFIG_DEBUG_FS is not set. The build fails after doing that. To fix it, it is necessary to remove all code specific to debugfs and instead provide a stub implementation of swiotlb_create_debugfs_files(). As a bonus, this change allows to remove one __maybe_unused attribute. Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-17swiotlb: Remove bounce buffer remapping for Hyper-VMichael Kelley1-44/+1
With changes to how Hyper-V guest VMs flip memory between private (encrypted) and shared (decrypted), creating a second kernel virtual mapping for shared memory is no longer necessary. Everything needed for the transition to shared is handled by set_memory_decrypted(). As such, remove swiotlb_unencrypted_base and the associated code. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/1679838727-87310-8-git-send-email-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>