summaryrefslogtreecommitdiff
path: root/drivers/vfio/pci
AgeCommit message (Collapse)AuthorFilesLines
11 daysmodule: Convert symbol namespace to string literalPeter Zijlstra3-3/+3
Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-11-25vfio/pci: Properly hide first-in-list PCIe extended capabilityAvihai Horon1-2/+14
There are cases where a PCIe extended capability should be hidden from the user. For example, an unknown capability (i.e., capability with ID greater than PCI_EXT_CAP_ID_MAX) or a capability that is intentionally chosen to be hidden from the user. Hiding a capability is done by virtualizing and modifying the 'Next Capability Offset' field of the previous capability so it points to the capability after the one that should be hidden. The special case where the first capability in the list should be hidden is handled differently because there is no previous capability that can be modified. In this case, the capability ID and version are zeroed while leaving the next pointer intact. This hides the capability and leaves an anchor for the rest of the capability list. However, today, hiding the first capability in the list is not done properly if the capability is unknown, as struct vfio_pci_core_device->pci_config_map is set to the capability ID during initialization but the capability ID is not properly checked later when used in vfio_config_do_rw(). This leads to the following warning [1] and to an out-of-bounds access to ecap_perms array. Fix it by checking cap_id in vfio_config_do_rw(), and if it is greater than PCI_EXT_CAP_ID_MAX, use an alternative struct perm_bits for direct read only access instead of the ecap_perms array. Note that this is safe since the above is the only case where cap_id can exceed PCI_EXT_CAP_ID_MAX (except for the special capabilities, which are already checked before). [1] WARNING: CPU: 118 PID: 5329 at drivers/vfio/pci/vfio_pci_config.c:1900 vfio_pci_config_rw+0x395/0x430 [vfio_pci_core] CPU: 118 UID: 0 PID: 5329 Comm: simx-qemu-syste Not tainted 6.12.0+ #1 (snip) Call Trace: <TASK> ? show_regs+0x69/0x80 ? __warn+0x8d/0x140 ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core] ? report_bug+0x18f/0x1a0 ? handle_bug+0x63/0xa0 ? exc_invalid_op+0x19/0x70 ? asm_exc_invalid_op+0x1b/0x20 ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core] ? vfio_pci_config_rw+0x244/0x430 [vfio_pci_core] vfio_pci_rw+0x101/0x1b0 [vfio_pci_core] vfio_pci_core_read+0x1d/0x30 [vfio_pci_core] vfio_device_fops_read+0x27/0x40 [vfio] vfs_read+0xbd/0x340 ? vfio_device_fops_unl_ioctl+0xbb/0x740 [vfio] ? __rseq_handle_notify_resume+0xa4/0x4b0 __x64_sys_pread64+0x96/0xc0 x64_sys_call+0x1c3d/0x20d0 do_syscall_64+0x4d/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver") Signed-off-by: Avihai Horon <avihaih@nvidia.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20241124142739.21698-1-avihaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-14vfio/mlx5: Fix unwind flows in mlx5vf_pci_save/resume_device_data()Yishai Hadas1-18/+17
Fix unwind flows in mlx5vf_pci_save_device_data() and mlx5vf_pci_resume_device_data() to avoid freeing the migf pointer at the 'end' label, as this will be handled by fput(migf->filp) through mlx5vf_release_file(). To ensure mlx5vf_release_file() functions correctly, move the initialization of migf fields (such as migf->lock) to occur before any potential unwind flow, as these fields may be accessed within mlx5vf_release_file(). Fixes: 9945a67ea4b3 ("vfio/mlx5: Refactor PD usage") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20241114095318.16556-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-14vfio/mlx5: Fix an unwind issue in mlx5vf_add_migration_pages()Yishai Hadas1-1/+5
Fix an unwind issue in mlx5vf_add_migration_pages(). If a set of pages is allocated but fails to be added to the SG table, they need to be freed to prevent a memory leak. Any pages successfully added to the SG table will be freed as part of mlx5vf_free_data_buffer(). Fixes: 6fadb021266d ("vfio/mlx5: Implement vfio_pci driver for mlx5 devices") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20241114095318.16556-2-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-13vfio/virtio: Enable live migration once VIRTIO_PCI was configuredYishai Hadas5-401/+489
Now that the driver supports live migration, only the legacy IO functionality depends on config VIRTIO_PCI_ADMIN_LEGACY. As part of that we introduce a bool configuration option as a sub menu under the driver's main live migration feature named VIRTIO_VFIO_PCI_ADMIN_LEGACY, to control the legacy IO functionality. This will let users configuring the kernel, know which features from the description might be available in the resulting driver. As of that, move the legacy IO into a separate file to be compiled only once CONFIG_VIRTIO_VFIO_PCI_ADMIN_LEGACY was configured and let the live migration depends only on VIRTIO_PCI. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20241113115200.209269-8-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-13vfio/virtio: Add PRE_COPY support for live migrationYishai Hadas2-8/+227
Add PRE_COPY support for live migration. This functionality may reduce the downtime upon STOP_COPY as of letting the target machine to get some 'initial data' from the source once the machine is still in its RUNNING state and let it prepares itself pre-ahead to get the final STOP_COPY data. As the Virtio specification does not support reading partial or incremental device contexts. This means that during the PRE_COPY state, the vfio-virtio driver reads the full device state. As the device state can be changed and the benefit is highest when the pre copy data closely matches the final data we read it in a rate limiter mode. This means we avoid reading new data from the device for a specified time interval after the last read. With PRE_COPY enabled, we observed a downtime reduction of approximately 70-75% in various scenarios compared to when PRE_COPY was disabled, while keeping the total migration time nearly the same. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20241113115200.209269-7-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-13vfio/virtio: Add support for the basic live migration functionalityYishai Hadas4-25/+1285
Add support for basic live migration functionality in VFIO over virtio-net devices, aligned with the virtio device specification 1.4. This includes the following VFIO features: VFIO_MIGRATION_STOP_COPY, VFIO_MIGRATION_P2P. The implementation registers with the VFIO subsystem using vfio_pci_core and then incorporates the virtio-specific logic for the migration process. The migration follows the definitions in uapi/vfio.h and leverages the virtio VF-to-PF admin queue command channel for execution device parts related commands. Additional Notes: ----------------- The kernel protocol between the source and target devices contains a header with metadata, including record size, tag, and flags. The record size allows the target to recognize and read a complete image from the source before passing the device part data. This adheres to the virtio device specification, which mandates that partial device parts cannot be supplied. The tag and flags serve as placeholders for future extensions of the kernel protocol between the source and target, ensuring backward and forward compatibility. Both the source and target comply with the virtio device specification by using a device part object with a unique ID as part of the migration process. Since this resource is limited to a maximum of 255, its lifecycle is confined to periods with an active live migration flow. According to the virtio specification, a device has only two modes: RUNNING and STOPPED. As a result, certain VFIO transitions (i.e., RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When transitioning to RUNNING_P2P, the device state is set to STOP, and it will remain STOPPED until the transition out of RUNNING_P2P->RUNNING, at which point it returns to RUNNING. During transition to STOP, the virtio device only stops initiating outgoing requests(e.g. DMA, MSIx, etc.) but still must accept incoming operations. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20241113115200.209269-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-13hisi_acc_vfio_pci: register debugfs for hisilicon migration driverLongfang Liu2-0/+210
On the debugfs framework of VFIO, if the CONFIG_VFIO_DEBUGFS macro is enabled, the debug function is registered for the live migration driver of the HiSilicon accelerator device. After registering the HiSilicon accelerator device on the debugfs framework of live migration of vfio, a directory file "hisi_acc" of debugfs is created, and then three debug function files are created in this directory: vfio | +---<dev_name1> | +---migration | +--state | +--hisi_acc | +--dev_data | +--migf_data | +--cmd_state | +---<dev_name2> +---migration +--state +--hisi_acc +--dev_data +--migf_data +--cmd_state dev_data file: read device data that needs to be migrated from the current device in real time migf_data file: read the migration data of the last live migration from the current driver. cmd_state: used to get the cmd channel state for the device. +----------------+ +--------------+ +---------------+ | migration dev | | src dev | | dst dev | +-------+--------+ +------+-------+ +-------+-------+ | | | | +------v-------+ +-------v-------+ | | saving_migf | | resuming_migf | read | | file | | file | | +------+-------+ +-------+-------+ | | copy | | +------------+----------+ | | +-------v--------+ +-------v--------+ | data buffer | | debug_migf | +-------+--------+ +-------+--------+ | | cat | cat | +-------v--------+ +-------v--------+ | dev_data | | migf_data | +----------------+ +----------------+ When accessing debugfs, user can obtain the most recent status data of the device through the "dev_data" file. It can read recent complete status data of the device. If the current device is being migrated, it will wait for it to complete. The data for the last completed migration function will be stored in debug_migf. Users can read it via "migf_data". Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20241112073322.54550-4-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-12hisi_acc_vfio_pci: create subfunction for data readingLongfang Liu1-21/+33
This patch generates the code for the operation of reading data from the device into a sub-function. Then, it can be called during the device status data saving phase of the live migration process and the device status data reading function in debugfs. Thereby reducing the redundant code of the driver. Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20241112073322.54550-3-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-11-12hisi_acc_vfio_pci: extract public functions for container_ofLongfang Liu1-10/+11
In the current driver, vdev is obtained from struct hisi_acc_vf_core_device through the container_of function. This method is used in many places in the driver. In order to reduce this repetitive operation, It was extracted into a public function. Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20241112073322.54550-2-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-10-30vfio/qat: fix overflow check in qat_vf_resume_write()Giovanni Cabiddu1-1/+1
The unsigned variable `size_t len` is cast to the signed type `loff_t` when passed to the function check_add_overflow(). This function considers the type of the destination, which is of type loff_t (signed), potentially leading to an overflow. This issue is similar to the one described in the link below. Remove the cast. Note that even if check_add_overflow() is bypassed, by setting `len` to a value that is greater than LONG_MAX (which is considered as a negative value after the cast), the function copy_from_user(), invoked a few lines later, will not perform any copy and return `len` as (len > INT_MAX) causing qat_vf_resume_write() to fail with -EFAULT. Fixes: bb208810b1ab ("vfio/qat: Add vfio_pci driver for Intel QAT SR-IOV VF devices") CC: stable@vger.kernel.org # 6.10+ Link: https://lore.kernel.org/all/138bd2e2-ede8-4bcc-aa7b-f3d9de167a37@moroto.mountain Reported-by: Zijie Zhao <zzjas98@gmail.com> Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Reviewed-by: Xin Zeng <xin.zeng@intel.com> Link: https://lore.kernel.org/r/20241021123843.42979-1-giovanni.cabiddu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-10-30vfio/nvgrace-gpu: Add a new GH200 SKU to the devid tableAnkit Agrawal1-0/+2
NVIDIA is planning to productize a new Grace Hopper superchip SKU with device ID 0x2348. Add the SKU devid to nvgrace_gpu_vfio_pci_table. Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20241013075216.19229-1-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-09-27[tree-wide] finally take no_llseek outAl Viro4-8/+0
no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-24Merge tag 'vfio-v6.12-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds1-6/+1
Pull VFIO updates from Alex Williamson: "Just a few cleanups this cycle: - Remove several unused structure and function declarations, and unused variables (Dr. David Alan Gilbert, Yue Haibing, Zhang Zekun) - Constify unmodified structure in mdev (Hongbo Li) - Convert to unsigned type to catch overflow with less fanfare than passing a negative value to kcalloc() (Dan Carpenter)" * tag 'vfio-v6.12-rc1' of https://github.com/awilliam/linux-vfio: vfio/pci: clean up a type in vfio_pci_ioctl_pci_hot_reset_groups() vfio/mdev: Constify struct kobj_type vfio: mdev: Remove unused function declarations vfio/fsl-mc: Remove unused variable 'hwirq' vfio/pci: Remove unused struct 'vfio_pci_mmap_vma'
2024-09-17vfio/pci: implement huge_fault supportAlex Williamson1-17/+43
With the addition of pfnmap support in vmf_insert_pfn_{pmd,pud}() we can take advantage of PMD and PUD faults to PCI BAR mmaps and create more efficient mappings. PCI BARs are always a power of two and will typically get at least PMD alignment without userspace even trying. Userspace alignment for PUD mappings is also not too difficult. Consolidate faults through a single handler with a new wrapper for standard single page faults. The pre-faulting behavior of commit d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") is removed in this refactoring since huge_fault will cover the bulk of the faults and results in more efficient page table usage. We also want to avoid that pre-faulted single page mappings preempt huge page mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-20-peterx@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-12vfio/pci: clean up a type in vfio_pci_ioctl_pci_hot_reset_groups()Dan Carpenter1-1/+1
The "array_count" value comes from the copy_from_user() in vfio_pci_ioctl_pci_hot_reset(). If the user passes a value larger than INT_MAX then we'll pass a negative value to kcalloc() which triggers an allocation failure and a stack trace. It's better to make the type unsigned so that if (array_count > count) returns -EINVAL instead. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/262ada03-d848-4369-9c37-81edeeed2da2@stanley.mountain Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-09-03vfio/pci: Remove unused struct 'vfio_pci_mmap_vma'Dr. David Alan Gilbert1-5/+0
'vfio_pci_mmap_vma' has been unused since commit aac6db75a9fc ("vfio/pci: Use unmap_mapping_range()") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240727160307.1000476-1-linux@treblig.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-07-19Merge tag 'vfio-v6.11-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds1-60/+62
Pull VFIO updates from Alex Williamson: - Add support for 8-byte accesses when using read/write through the device regions. This fills a gap for userspace drivers that might not be able to use access through mmap to perform native register width accesses (Gerd Bayer) - Add missing MODULE_DESCRIPTION to vfio-mdev sample drivers and replace a non-standard MODULE_INFO usage (Jeff Johnson) * tag 'vfio-v6.11-rc1' of https://github.com/awilliam/linux-vfio: vfio-mdev: add missing MODULE_DESCRIPTION() macros vfio/pci: Fix typo in macro to declare accessors vfio/pci: Support 8-byte PCI loads and stores vfio/pci: Extract duplicated code into macro
2024-07-10vfio/pci: Init the count variable in collecting hot-reset devicesYi Liu1-1/+1
The count variable is used without initialization, it results in mistakes in the device counting and crashes the userspace if the get hot reset info path is triggered. Fixes: f6944d4a0b87 ("vfio/pci: Collect hot-reset devices to local buffer") Link: https://bugzilla.kernel.org/show_bug.cgi?id=219010 Reported-by: Žilvinas Žaltiena <zaltys@natrix.lt> Cc: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240710004150.319105-1-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-06-21vfio/pci: Support 8-byte PCI loads and storesBen Segal1-0/+16
Many PCI adapters can benefit or even require full 64bit read and write access to their registers. In order to enable work on user-space drivers for these devices add two new variations vfio_pci_core_io{read|write}64 of the existing access methods when the architecture supports 64-bit ioreads and iowrites. Signed-off-by: Ben Segal <bpsegal@us.ibm.com> Co-developed-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Link: https://lore.kernel.org/r/20240619115847.1344875-3-gbayer@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-06-21vfio/pci: Extract duplicated code into macroGerd Bayer1-60/+46
vfio_pci_core_do_io_rw() repeats the same code for multiple access widths. Factor this out into a macro Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Link: https://lore.kernel.org/r/20240619115847.1344875-2-gbayer@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-06-12vfio/pci: Insert full vma on mmap'd MMIO faultAlex Williamson1-2/+17
In order to improve performance of typical scenarios we can try to insert the entire vma on fault. This accelerates typical cases, such as when the MMIO region is DMA mapped by QEMU. The vfio_iommu_type1 driver will fault in the entire DMA mapped range through fixup_user_fault(). In synthetic testing, this improves the time required to walk a PCI BAR mapping from userspace by roughly 1/3rd. This is likely an interim solution until vmf_insert_pfn_{pmd,pud}() gain support for pfnmaps. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/all/Zl6XdUkt%2FzMMGOLF@yzhao56-desk.sh.intel.com/ Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20240607035213.2054226-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-05-31vfio/pci: Use unmap_mapping_range()Alex Williamson1-209/+55
With the vfio device fd tied to the address space of the pseudo fs inode, we can use the mm to track all vmas that might be mmap'ing device BARs, which removes our vma_list and all the complicated lock ordering necessary to manually zap each related vma. Note that we can no longer store the pfn in vm_pgoff if we want to use unmap_mapping_range() to zap a selective portion of the device fd corresponding to BAR mappings. This also converts our mmap fault handler to use vmf_insert_pfn() because we no longer have a vma_list to avoid the concurrency problem with io_remap_pfn_range(). The goal is to eventually use the vm_ops huge_fault handler to avoid the additional faulting overhead, but vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first. Also, Jason notes that a race exists between unmap_mapping_range() and the fops mmap callback if we were to call io_remap_pfn_range() to populate the vma on mmap. Specifically, mmap_region() does call_mmap() before it does vma_link_file() which gives a window where the vma is populated but invisible to unmap_mapping_range(). Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240530045236.1005864-3-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-05-20Merge tag 'vfio-v6.10-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds7-63/+800
Pull vfio updates from Alex Williamson: - The vfio fsl-mc bus driver has become orphaned. We'll consider removing it in future releases if a new maintainer isn't found (Alex Williamson) - Improved usage of opaque data in vfio-pci INTx handling, avoiding lookups of the eventfd through the interrupt and irqfd runtime paths (Alex Williamson) - Resolve an error path memory leak introduced in vfio-pci interrupt code (Ye Bin) - Addition of interrupt support for vfio devices exposed on the CDX bus, including a new MSI allocation helper and export of existing helpers for MSI alloc and free (Nipun Gupta) - A new vfio-pci variant driver supporting migration of Intel QAT VF devices for the GEN4 PFs (Xin Zeng & Yahui Cao) - Resolve a possibly circular locking dependency in vfio-pci by avoiding copy_to_user() from a PCI bus walk callback (Alex Williamson) - Trivial docs update to remove a duplicate semicolon (Foryun Ma) * tag 'vfio-v6.10-rc1' of https://github.com/awilliam/linux-vfio: vfio/pci: Restore zero affected bus reset devices warning vfio: remove an extra semicolon vfio/pci: Collect hot-reset devices to local buffer vfio/qat: Add vfio_pci driver for Intel QAT SR-IOV VF devices vfio/cdx: add interrupt support genirq/msi: Add MSI allocation helper and export MSI functions vfio/pci: fix potential memory leak in vfio_intx_enable() vfio/pci: Pass eventfd context object through irqfd vfio/pci: Pass eventfd context to IRQ handler MAINTAINERS: Orphan vfio fsl-mc bus driver
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-17vfio/pci: Restore zero affected bus reset devices warningAlex Williamson1-0/+3
Yi notes relative to commit f6944d4a0b87 ("vfio/pci: Collect hot-reset devices to local buffer") that we previously tested the resulting device count with a WARN_ON, which was removed when we switched to the in-loop user copy in commit b56b7aabcf3c ("vfio/pci: Copy hot-reset device info to userspace in the devices loop"). Finding no devices in the bus/slot would be an unexpected condition, so let's restore the warning and trigger a -ERANGE error here as success with no devices would be an unexpected result to userspace as well. Suggested-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20240516174831.2257970-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-05-13VFIO: Add the SPR_DSA and SPR_IAX devices to the denylistArjan van de Ven1-0/+2
Due to an erratum with the SPR_DSA and SPR_IAX devices, it is not secure to assign these devices to virtual machines. Add the PCI IDs of these devices to the VFIO denylist to ensure that this is handled appropriately by the VFIO subsystem. The SPR_DSA and SPR_IAX devices are on-SOC devices for the Sapphire Rapids (and related) family of products that perform data movement and compression. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2024-05-09vfio/pci: Collect hot-reset devices to local bufferAlex Williamson1-29/+49
Lockdep reports the below circular locking dependency issue. The mmap_lock acquisition while holding pci_bus_sem is due to the use of copy_to_user() from within a pci_walk_bus() callback. Building the devices array directly into the user buffer is only for convenience. Instead we can allocate a local buffer for the array, bounded by the number of devices on the bus/slot, fill the device information into this local buffer, then copy it into the user buffer outside the bus walk callback. ====================================================== WARNING: possible circular locking dependency detected 6.9.0-rc5+ #39 Not tainted ------------------------------------------------------ CPU 0/KVM/4113 is trying to acquire lock: ffff99a609ee18a8 (&vdev->vma_lock){+.+.}-{4:4}, at: vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] but task is already holding lock: ffff99a243a052a0 (&mm->mmap_lock){++++}-{4:4}, at: vaddr_get_pfns+0x3f/0x170 [vfio_iommu_type1] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&mm->mmap_lock){++++}-{4:4}: __lock_acquire+0x4e4/0xb90 lock_acquire+0xbc/0x2d0 __might_fault+0x5c/0x80 _copy_to_user+0x1e/0x60 vfio_pci_fill_devs+0x9f/0x130 [vfio_pci_core] vfio_pci_walk_wrapper+0x45/0x60 [vfio_pci_core] __pci_walk_bus+0x6b/0xb0 vfio_pci_ioctl_get_pci_hot_reset_info+0x10b/0x1d0 [vfio_pci_core] vfio_pci_core_ioctl+0x1cb/0x400 [vfio_pci_core] vfio_device_fops_unl_ioctl+0x7e/0x140 [vfio] __x64_sys_ioctl+0x8a/0xc0 do_syscall_64+0x8d/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (pci_bus_sem){++++}-{4:4}: __lock_acquire+0x4e4/0xb90 lock_acquire+0xbc/0x2d0 down_read+0x3e/0x160 pci_bridge_wait_for_secondary_bus.part.0+0x33/0x2d0 pci_reset_bus+0xdd/0x160 vfio_pci_dev_set_hot_reset+0x256/0x270 [vfio_pci_core] vfio_pci_ioctl_pci_hot_reset_groups+0x1a3/0x280 [vfio_pci_core] vfio_pci_core_ioctl+0x3b5/0x400 [vfio_pci_core] vfio_device_fops_unl_ioctl+0x7e/0x140 [vfio] __x64_sys_ioctl+0x8a/0xc0 do_syscall_64+0x8d/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&vdev->memory_lock){+.+.}-{4:4}: __lock_acquire+0x4e4/0xb90 lock_acquire+0xbc/0x2d0 down_write+0x3b/0xc0 vfio_pci_zap_and_down_write_memory_lock+0x1c/0x30 [vfio_pci_core] vfio_basic_config_write+0x281/0x340 [vfio_pci_core] vfio_config_do_rw+0x1fa/0x300 [vfio_pci_core] vfio_pci_config_rw+0x75/0xe50 [vfio_pci_core] vfio_pci_rw+0xea/0x1a0 [vfio_pci_core] vfs_write+0xea/0x520 __x64_sys_pwrite64+0x90/0xc0 do_syscall_64+0x8d/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&vdev->vma_lock){+.+.}-{4:4}: check_prev_add+0xeb/0xcc0 validate_chain+0x465/0x530 __lock_acquire+0x4e4/0xb90 lock_acquire+0xbc/0x2d0 __mutex_lock+0x97/0xde0 vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] __do_fault+0x31/0x160 do_pte_missing+0x65/0x3b0 __handle_mm_fault+0x303/0x720 handle_mm_fault+0x10f/0x460 fixup_user_fault+0x7f/0x1f0 follow_fault_pfn+0x66/0x1c0 [vfio_iommu_type1] vaddr_get_pfns+0xf2/0x170 [vfio_iommu_type1] vfio_pin_pages_remote+0x348/0x4e0 [vfio_iommu_type1] vfio_pin_map_dma+0xd2/0x330 [vfio_iommu_type1] vfio_dma_do_map+0x2c0/0x440 [vfio_iommu_type1] vfio_iommu_type1_ioctl+0xc5/0x1d0 [vfio_iommu_type1] __x64_sys_ioctl+0x8a/0xc0 do_syscall_64+0x8d/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &vdev->vma_lock --> pci_bus_sem --> &mm->mmap_lock Possible unsafe locking scenario: block dm-0: the capability attribute has been deprecated. CPU0 CPU1 ---- ---- rlock(&mm->mmap_lock); lock(pci_bus_sem); lock(&mm->mmap_lock); lock(&vdev->vma_lock); *** DEADLOCK *** 2 locks held by CPU 0/KVM/4113: #0: ffff99a25f294888 (&iommu->lock#2){+.+.}-{4:4}, at: vfio_dma_do_map+0x60/0x440 [vfio_iommu_type1] #1: ffff99a243a052a0 (&mm->mmap_lock){++++}-{4:4}, at: vaddr_get_pfns+0x3f/0x170 [vfio_iommu_type1] stack backtrace: CPU: 1 PID: 4113 Comm: CPU 0/KVM Not tainted 6.9.0-rc5+ #39 Hardware name: Dell Inc. PowerEdge T640/04WYPY, BIOS 2.15.1 06/16/2022 Call Trace: <TASK> dump_stack_lvl+0x64/0xa0 check_noncircular+0x131/0x150 check_prev_add+0xeb/0xcc0 ? add_chain_cache+0x10a/0x2f0 ? __lock_acquire+0x4e4/0xb90 validate_chain+0x465/0x530 __lock_acquire+0x4e4/0xb90 lock_acquire+0xbc/0x2d0 ? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] ? lock_is_held_type+0x9a/0x110 __mutex_lock+0x97/0xde0 ? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] ? lock_acquire+0xbc/0x2d0 ? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] ? find_held_lock+0x2b/0x80 ? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core] __do_fault+0x31/0x160 do_pte_missing+0x65/0x3b0 __handle_mm_fault+0x303/0x720 handle_mm_fault+0x10f/0x460 fixup_user_fault+0x7f/0x1f0 follow_fault_pfn+0x66/0x1c0 [vfio_iommu_type1] vaddr_get_pfns+0xf2/0x170 [vfio_iommu_type1] vfio_pin_pages_remote+0x348/0x4e0 [vfio_iommu_type1] vfio_pin_map_dma+0xd2/0x330 [vfio_iommu_type1] vfio_dma_do_map+0x2c0/0x440 [vfio_iommu_type1] vfio_iommu_type1_ioctl+0xc5/0x1d0 [vfio_iommu_type1] __x64_sys_ioctl+0x8a/0xc0 do_syscall_64+0x8d/0x170 ? rcu_core+0x8d/0x250 ? __lock_release+0x5e/0x160 ? rcu_core+0x8d/0x250 ? lock_release+0x5f/0x120 ? sched_clock+0xc/0x30 ? sched_clock_cpu+0xb/0x190 ? irqtime_account_irq+0x40/0xc0 ? __local_bh_enable+0x54/0x60 ? __do_softirq+0x315/0x3ca ? lockdep_hardirqs_on_prepare.part.0+0x97/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f8300d0357b Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 68 0f 00 f7 d8 64 89 01 48 RSP: 002b:00007f82ef3fb948 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8300d0357b RDX: 00007f82ef3fb990 RSI: 0000000000003b71 RDI: 0000000000000023 RBP: 00007f82ef3fb9c0 R08: 0000000000000000 R09: 0000561b7e0bcac2 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000200000000 R14: 0000381800000000 R15: 0000000000000000 </TASK> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240503143138.3562116-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-04-29vfio/qat: Add vfio_pci driver for Intel QAT SR-IOV VF devicesXin Zeng5-0/+721
Add vfio pci variant driver for Intel QAT SR-IOV VF devices. This driver registers to the vfio subsystem through the interfaces exposed by the subsystem. It follows the live migration protocol v2 defined in uapi/linux/vfio.h and interacts with Intel QAT PF driver through a set of interfaces defined in qat/qat_mig_dev.h to support live migration of Intel QAT VF devices. This version only covers migration for Intel QAT GEN4 VF devices. Co-developed-by: Yahui Cao <yahui.cao@intel.com> Signed-off-by: Yahui Cao <yahui.cao@intel.com> Signed-off-by: Xin Zeng <xin.zeng@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240426064051.2859652-1-xin.zeng@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-04-25fix missing vmalloc.h includesKent Overstreet1-0/+1
Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev [surenb@google.com: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com [kent.overstreet@linux.dev: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev [arnd@arndb.de: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org [surenb@google.com: fix arc build] Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-22vfio/pci: fix potential memory leak in vfio_intx_enable()Ye Bin1-1/+3
If vfio_irq_ctx_alloc() failed will lead to 'name' memory leak. Fixes: 18c198c96a81 ("vfio/pci: Create persistent INTx handler") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20240415015029.3699844-1-yebin10@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-04-22vfio/pci: Pass eventfd context object through irqfdAlex Williamson1-20/+13
Further avoid lookup of the context object by passing it through the irqfd data field. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240401195406.3720453-3-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-04-22vfio/pci: Pass eventfd context to IRQ handlerAlex Williamson1-13/+11
Create a link back to the vfio_pci_core_device on the eventfd context object to avoid lookups in the interrupt path. The context is known valid in the interrupt handler. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240401195406.3720453-2-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pci: Create persistent INTx handlerAlex Williamson1-67/+78
A vulnerability exists where the eventfd for INTx signaling can be deconfigured, which unregisters the IRQ handler but still allows eventfds to be signaled with a NULL context through the SET_IRQS ioctl or through unmask irqfd if the device interrupt is pending. Ideally this could be solved with some additional locking; the igate mutex serializes the ioctl and config space accesses, and the interrupt handler is unregistered relative to the trigger, but the irqfd path runs asynchronous to those. The igate mutex cannot be acquired from the atomic context of the eventfd wake function. Disabling the irqfd relative to the eventfd registration is potentially incompatible with existing userspace. As a result, the solution implemented here moves configuration of the INTx interrupt handler to track the lifetime of the INTx context object and irq_type configuration, rather than registration of a particular trigger eventfd. Synchronization is added between the ioctl path and eventfd_signal() wrapper such that the eventfd trigger can be dynamically updated relative to in-flight interrupts or irqfd callbacks. Cc: <stable@vger.kernel.org> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver") Reported-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-5-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pci: Lock external INTx masking opsAlex Williamson1-6/+28
Mask operations through config space changes to DisINTx may race INTx configuration changes via ioctl. Create wrappers that add locking for paths outside of the core interrupt code. In particular, irq_type is updated holding igate, therefore testing is_intx() requires holding igate. For example clearing DisINTx from config space can otherwise race changes of the interrupt configuration. This aligns interfaces which may trigger the INTx eventfd into two camps, one side serialized by igate and the other only enabled while INTx is configured. A subsequent patch introduces synchronization for the latter flows. Cc: <stable@vger.kernel.org> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver") Reported-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-3-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pci: Disable auto-enable of exclusive INTx IRQAlex Williamson1-7/+10
Currently for devices requiring masking at the irqchip for INTx, ie. devices without DisINTx support, the IRQ is enabled in request_irq() and subsequently disabled as necessary to align with the masked status flag. This presents a window where the interrupt could fire between these events, resulting in the IRQ incrementing the disable depth twice. This would be unrecoverable for a user since the masked flag prevents nested enables through vfio. Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx is never auto-enabled, then unmask as required. Cc: <stable@vger.kernel.org> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver") Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-2-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pds: Refactor/simplify reset logicBrett Creeley4-67/+19
The current logic for handling resets is more complicated than it needs to be. The deferred_reset flag is used to indicate a reset is needed and the deferred_reset_state is the requested, post-reset, state. Also, the deferred_reset logic was added to vfio migration drivers to prevent a circular locking dependency with respect to mm_lock and state mutex. This is mainly because of the copy_to/from_user() functions(which takes mm_lock) invoked under state mutex. Remove all of the deferred reset logic and just pass the requested next state to pds_vfio_reset() so it can be used for VMM and DSC initiated resets. This removes the need for pds_vfio_state_mutex_lock(), so remove that and replace its use with a simple mutex_unlock(). Also, remove the reset_mutex as it's no longer needed since the state_mutex can be the driver's primary protector. Suggested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Link: https://lore.kernel.org/r/20240308182149.22036-3-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pds: Make sure migration file isn't accessed after resetBrett Creeley2-0/+14
It's possible the migration file is accessed after reset when it has been cleaned up, especially when it's initiated by the device. This is because the driver doesn't rip out the filep when cleaning up it only frees the related page structures and sets its local struct pds_vfio_lm_file pointer to NULL. This can cause a NULL pointer dereference, which is shown in the example below during a restore after a device initiated reset: BUG: kernel NULL pointer dereference, address: 000000000000000c PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:pds_vfio_get_file_page+0x5d/0xf0 [pds_vfio_pci] [...] Call Trace: <TASK> pds_vfio_restore_write+0xf6/0x160 [pds_vfio_pci] vfs_write+0xc9/0x3f0 ? __fget_light+0xc9/0x110 ksys_write+0xb5/0xf0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] Add a disabled flag to the driver's struct pds_vfio_lm_file that gets set during cleanup. Then make sure to check the flag when the migration file is accessed via its file_operations. By default this flag will be false as the memory for struct pds_vfio_lm_file is kzalloc'd, which means the struct pds_vfio_lm_file is enabled and accessible. Also, since the file_operations and driver's migration file cleanup happen under the protection of the same pds_vfio_lm_file.lock, using this flag is thread safe. Fixes: 8512ed256334 ("vfio/pds: Always clear the save/restore FDs on reset") Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Link: https://lore.kernel.org/r/20240308182149.22036-2-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/mlx5: Enforce PRE_COPY supportYishai Hadas3-127/+71
Enable live migration only once the firmware supports PRE_COPY. PRE_COPY has been supported by the firmware for a long time already [1] and is required to achieve a low downtime upon live migration. This lets us clean up some old code that is not applicable those days while PRE_COPY is fully supported by the firmware. [1] The minimum firmware version that supports PRE_COPY is 28.36.1010, it was released in January 2023. No firmware without PRE_COPY support ever available to users. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20240306105624.114830-1-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-05hisi_acc_vfio_pci: Remove the deferred_reset logicShameer Kolothum2-40/+14
The deferred_reset logic was added to vfio migration drivers to prevent a circular locking dependency with respect to mm_lock and state mutex. This is mainly because of the copy_to/from_user() functions(which takes mm_lock) invoked under state mutex. But for HiSilicon driver, the only place where we now hold the state mutex for copy_to_user is during the PRE_COPY IOCTL. So for pre_copy, release the lock as soon as we have updated the data and perform copy_to_user without state mutex. By this, we can get rid of the deferred_reset logic. Link: https://lore.kernel.org/kvm/20240220132459.GM13330@nvidia.com/ Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240229091152.56664-1-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-04vfio/nvgrace-gpu: Convey kvm to map device memory region as noncachedAnkit Agrawal1-1/+10
The NVIDIA Grace Hopper GPUs have device memory that is supposed to be used as a regular RAM. It is accessible through CPU-GPU chip-to-chip cache coherent interconnect and is present in the system physical address space. The device memory is split into two regions - termed as usemem and resmem - in the system physical address space, with each region mapped and exposed to the VM as a separate fake device BAR [1]. Owing to a hardware defect for Multi-Instance GPU (MIG) feature [2], there is a requirement - as a workaround - for the resmem BAR to display uncached memory characteristics. Based on [3], on system with FWB enabled such as Grace Hopper, the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host mapping (S2) of MT_S2_FWB_NORMAL_NC. KVM currently maps the MMIO region in S2 as MT_S2_FWB_DEVICE_nGnRE by default. The fake device BARs thus displays DEVICE_nGnRE behavior in the VM. The following table summarizes the behavior for the various S1 and S2 mapping combinations for systems with FWB enabled [3]. S1 | S2 | Result NORMAL_NC | NORMAL_NC | NORMAL_NC NORMAL_NC | DEVICE_nGnRE | DEVICE_nGnRE Recently a change was added that modifies this default behavior and make KVM map MMIO as MT_S2_FWB_NORMAL_NC when a VMA flag VM_ALLOW_ANY_UNCACHED is set [4]. Setting S2 as MT_S2_FWB_NORMAL_NC provides the desired behavior (uncached, unaligned access) for resmem. To use VM_ALLOW_ANY_UNCACHED flag, the platform must guarantee that no action taken on the MMIO mapping can trigger an uncontained failure. The Grace Hopper satisfies this requirement. So set the VM_ALLOW_ANY_UNCACHED flag in the VMA. Applied over next-20240227. base-commit: 22ba90670a51 Link: https://lore.kernel.org/all/20240220115055.23546-4-ankita@nvidia.com/ [1] Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [2] Link: https://developer.arm.com/documentation/ddi0487/latest/ section D8.5.5 [3] Link: https://lore.kernel.org/all/20240224150546.368-1-ankita@nvidia.com/ [4] Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Vikram Sethi <vsethi@nvidia.com> Cc: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240229193934.2417-1-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-04Merge branch 'kvm-arm64/vfio-normal-nc' of ↵Alex Williamson1-1/+18
https://git.kernel.org/pub/scm/linux/kernel/git/oupton/linux into v6.9/vfio/next
2024-03-01vfio/pds: Always clear the save/restore FDs on resetBrett Creeley1-2/+2
After reset the VFIO device state will always be put in VFIO_DEVICE_STATE_RUNNING, but the save/restore files will only be cleared if the previous state was VFIO_DEVICE_STATE_ERROR. This can/will cause the restore/save files to be leaked if/when the migration state machine transitions through the states that re-allocates these files. Fix this by always clearing the restore/save files for resets. Fixes: 7dabb1bcd177 ("vfio/pds: Add support for firmware recovery") Cc: stable@vger.kernel.org Signed-off-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240228003205.47311-2-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-24vfio: Convey kvm that the vfio-pci device is wc safeAnkit Agrawal1-1/+18
The VM_ALLOW_ANY_UNCACHED flag is implemented for ARM64, allowing KVM stage 2 device mapping attributes to use Normal-NC rather than DEVICE_nGnRE, which allows guest mappings supporting write-combining attributes (WC). ARM does not architecturally guarantee this is safe, and indeed some MMIO regions like the GICv2 VCPU interface can trigger uncontained faults if Normal-NC is used. To safely use VFIO in KVM the platform must guarantee full safety in the guest where no action taken against a MMIO mapping can trigger an uncontained failure. The expectation is that most VFIO PCI platforms support this for both mapping types, at least in common flows, based on some expectations of how PCI IP is integrated. So make vfio-pci set the VM_ALLOW_ANY_UNCACHED flag. Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240224150546.368-5-ankita@nvidia.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-22vfio/nvgrace-gpu: Add vfio pci variant module for grace hopperAnkit Agrawal5-0/+896
NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device for the on-chip GPU that is the logical OS representation of the internal proprietary chip-to-chip cache coherent interconnect. The device is peculiar compared to a real PCI device in that whilst there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the device, it is not used to access device memory once the faster chip-to-chip interconnect is initialized (occurs at the time of host system boot). The device memory is accessed instead using the chip-to-chip interconnect that is exposed as a contiguous physically addressable region on the host. This device memory aperture can be obtained from host ACPI table using device_property_read_u64(), according to the FW specification. Since the device memory is cache coherent with the CPU, it can be mmap into the user VMA with a cacheable mapping using remap_pfn_range() and used like a regular RAM. The device memory is not added to the host kernel, but mapped directly as this reduces memory wastage due to struct pages. There is also a requirement of a minimum reserved 1G uncached region (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is to work around a HW defect. Based on [2], the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide a different non-cached property to the reserved 1G region, it needs to be carved out from the device memory and mapped as a separate region in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Provide a VFIO PCI variant driver that adapts the unique device memory representation into a more standard PCI representation facing userspace. The variant driver exposes these two regions - the non-cached reserved (resmem) and the cached rest of the device memory (termed as usemem) as separate VFIO 64b BAR regions. This is divergent from the baremetal approach, where the device memory is exposed as a device memory region. The decision for a different approach was taken in view of the fact that it would necessiate additional code in Qemu to discover and insert those regions in the VM IPA, along with the additional VM ACPI DSDT changes to communicate the device memory region IPA to the VM workloads. Moreover, this behavior would have to be added to a variety of emulators (beyond top of tree Qemu) out there desiring grace hopper support. Since the device implements 64-bit BAR0, the VFIO PCI variant driver maps the uncached carved out region to the next available PCI BAR (i.e. comprising of region 2 and 3). The cached device memory aperture is assigned BAR region 4 and 5. Qemu will then naturally generate a PCI device in the VM with the uncached aperture reported as BAR2 region, the cacheable as BAR4. The variant driver provides emulation for these fake BARs' PCI config space offset registers. The hardware ensures that the system does not crash when the memory is accessed with the memory enable turned off. It synthesis ~0 reads and dropped writes on such access. So there is no need to support the disablement/enablement of BAR through PCI_COMMAND config space register. The memory layout on the host looks like the following: devmem (memlength) |--------------------------------------------------| |-------------cached------------------------|--NC--| | | usemem.memphys resmem.memphys PCI BARs need to be aligned to the power-of-2, but the actual memory on the device may not. A read or write access to the physical address from the last device PFN up to the next power-of-2 aligned physical address results in reading ~0 and dropped writes. Note that the GPU device driver [6] is capable of knowing the exact device memory size through separate means. The device memory size is primarily kept in the system ACPI tables for use by the VFIO PCI variant module. Note that the usemem memory is added by the VM Nvidia device driver [5] to the VM kernel as memblocks. Hence make the usable memory size memblock (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW and VFIO driver. The VM device driver make use of the same value for its calculation to determine USEMEM size. Currently there is no provision in KVM for a S2 mapping with MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. As previously mentioned, resmem is mapped pgprot_writecombine(), that sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the proposed changes in [3] and [4], KVM marks the region with MemAttr[2:0]=0b101 in S2. If the device memory properties are not present, the driver registers the vfio-pci-core function pointers. Since there are no ACPI memory properties generated for the VM, the variant driver inside the VM will only use the vfio-pci-core ops and hence try to map the BARs as non cached. This is not a problem as the CPUs have FWB enabled which blocks the VM mapping's ability to override the cacheability set by the host mapping. This goes along with a qemu series [6] to provides the necessary implementation of the Grace Hopper Superchip firmware specification so that the guest operating system can see the correct ACPI modeling for the coherent GPU device. Verified with the CUDA workload in the VM. [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [2] section D8.5.5 of https://developer.arm.com/documentation/ddi0487/latest/ [3] https://lore.kernel.org/all/20240211174705.31992-1-ankita@nvidia.com/ [4] https://lore.kernel.org/all/20230907181459.18145-2-ankita@nvidia.com/ [5] https://github.com/NVIDIA/open-gpu-kernel-modules [6] https://lore.kernel.org/all/20231203060245.31593-1-ankita@nvidia.com/ Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Aniket Agashe <aniketa@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240220115055.23546-4-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/pci: rename and export range_intersect_rangeAnkit Agrawal2-46/+68
range_intersect_range determines an overlap between two ranges. If an overlap, the helper function returns the overlapping offset and size. The VFIO PCI variant driver emulates the PCI config space BAR offset registers. These offset may be accessed for read/write with a variety of lengths including sub-word sizes from sub-word offsets. The driver makes use of this helper function to read/write the targeted part of the emulated register. Make this a vfio_pci_core function, rename and export as GPL. Also update references in virtio driver. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240220115055.23546-3-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/pci: rename and export do_io_rw()Ankit Agrawal1-7/+9
do_io_rw() is used to read/write to the device MMIO. The grace hopper VFIO PCI variant driver require this functionality to read/write to its memory. Rename this as vfio_pci_core functions and export as GPL. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240220115055.23546-2-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/mlx5: Let firmware knows upon leaving PRE_COPY back to RUNNINGYishai Hadas3-12/+45
Let firmware knows upon leaving PRE_COPY back to RUNNING as of some error in the target/migration cancellation. This will let firmware cleaning its internal resources that were turned on upon PRE_COPY. The flow is based on the device specification in this area. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/mlx5: Block incremental query upon migf state errorYishai Hadas1-0/+5
Block incremental query which is state-dependent once the migration file was previously marked with state error. This may prevent redundant calls to firmware upon PRE_COPY which will end-up with a failure and a syndrome printed in dmesg. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22vfio/mlx5: Handle the EREMOTEIO error upon the SAVE commandYishai Hadas1-1/+6
The SAVE command uses the async command interface over the PF. Upon a failure in the firmware -EREMOTEIO is returned. In that case call mlx5_cmd_out_err() to let it print the command failure details including the firmware syndrome. Note: The other commands in the driver use the sync command interface in a way that a firmware syndrome is printed upon an error inside mlx5_core. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>