summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)AuthorFilesLines
2024-05-13Merge tag 'vfs-6.10.misc' of ↵Linus Torvalds1-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That means we now have six free FMODE_* bits in total (but bit #6 already got used for FMODE_WRITE_RESTRICTED) - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup) - Add fd_raw cleanup class so we can make use of automatic cleanup provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well - Optimize seq_puts() - Simplify __seq_puts() - Add new anon_inode_getfile_fmode() api to allow specifying f_mode instead of open-coding it in multiple places - Annotate struct file_handle with __counted_by() and use struct_size() - Warn in get_file() whether f_count resurrection from zero is attempted (epoll/drm discussion) - Folio-sophize aio - Export the subvolume id in statx() for both btrfs and bcachefs - Relax linkat(AT_EMPTY_PATH) requirements - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors for dup*() equality replacing kcmp() Cleanups: - Compile out swapfile inode checks when swap isn't enabled - Use (1 << n) notation for FMODE_* bitshifts for clarity - Remove redundant variable assignment in fs/direct-io - Cleanup uses of strncpy in orangefs - Speed up and cleanup writeback - Move fsparam_string_empty() helper into header since it's currently open-coded in multiple places - Add kernel-doc comments to proc_create_net_data_write() - Don't needlessly read dentry->d_flags twice Fixes: - Fix out-of-range warning in nilfs2 - Fix ecryptfs overflow due to wrong encryption packet size calculation - Fix overly long line in xfs file_operations (follow-up to FMODE_* cleanup) - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up to FMODE_* cleanup) - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_* cleanup) - Fix stable offset api to prevent endless loops - Fix afs file server rotations - Prevent xattr node from overflowing the eraseblock in jffs2 - Move fdinfo PTRACE_MODE_READ procfs check into the .permission() operation instead of .open() operation since this caused userspace regressions" * tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) afs: Fix fileserver rotation getting stuck selftests: add F_DUPDFD_QUERY selftests fcntl: add F_DUPFD_QUERY fcntl() file: add fd_raw cleanup class fs: WARN when f_count resurrection is attempted seq_file: Simplify __seq_puts() seq_file: Optimize seq_puts() proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation fs: Create anon_inode_getfile_fmode() xfs: don't call xfs_file_open from xfs_dir_open xfs: drop fop_flags for directories xfs: fix overly long line in the file_operations shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() jffs2: prevent xattr node from overflowing the eraseblock vfs, swap: compile out IS_SWAPFILE() on swapless configs vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements fs/direct-io: remove redundant assignment to variable retval fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading ...
2024-04-24xfs: don't call xfs_file_open from xfs_dir_openChristoph Hellwig1-1/+3
Directories do not support direct I/O and thus no non-blocking direct I/O either. Open code the shutdown check and call to generic_file_open instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-4-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-24xfs: drop fop_flags for directoriesChristoph Hellwig1-2/+0
Directories have non of the capabilities, so drop the flags. Note that the current state is harmless as no one actually checks for the flags either. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-3-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-24xfs: fix overly long line in the file_operationsChristoph Hellwig1-4/+4
Re-wrap the newly added fop_flags fields to not go over 80 characters. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-2-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-07fs: claw back a few FMODE_* bitsChristian Brauner1-3/+5
There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-06Merge tag 'xfs-6.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-2/+13
Pull xfs fix from Chandan Babu: - Allow creating new links to special files which were not associated with a project quota * tag 'xfs-6.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: allow cross-linking special files without project quota
2024-04-05Merge tag 'vfs-6.9-rc3.fixes' of ↵Linus Torvalds2-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains a few small fixes. This comes with some delay because I wanted to wait on people running their reproducers and the Easter Holidays meant that those replies came in a little later than usual: - Fix handling of preventing writes to mounted block devices. Since last kernel we allow to prevent writing to mounted block devices provided CONFIG_BLK_DEV_WRITE_MOUNTED isn't set and the block device is opened with restricted writes. When we switched to opening block devices as files we altered the mechanism by which we recognize when a block device has been opened with write restrictions. The detection logic assumed that only read-write mounted filesystems would apply write restrictions to their block devices from other openers. That of course is not true since it also makes sense to apply write restrictions for filesystems that are read-only. Fix the detection logic using an FMODE_* bit. We still have a few left since we freed up a couple a while ago. I also picked up a patch to free up four additional FMODE_* bits scheduled for the next merge window. - Fix counting the number of writers to a block device. This just changes the logic to be consistent. - Fix a bug in aio causing a NULL pointer derefernce after we implemented batched processing in aio. - Finally, add the changes we discussed that allows to yield block devices early even though file closing itself is deferred. This also allows us to remove two holder operations to get and release the holder to align lifetime of file and holder of the block device" * tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: aio: Fix null ptr deref in aio_complete() wakeup fs,block: yield devices early block: count BLK_OPEN_RESTRICT_WRITES openers block: handle BLK_OPEN_RESTRICT_WRITES correctly
2024-04-01xfs: allow cross-linking special files without project quotaAndrey Albershteyn1-2/+13
There's an issue that if special files is created before quota project is enabled, then it's not possible to link this file. This works fine for normal files. This happens because xfs_quota skips special files (no ioctls to set necessary flags). The check for having the same project ID for source and destination then fails as source file doesn't have any ID. mkfs.xfs -f /dev/sda mount -o prjquota /dev/sda /mnt/test mkdir /mnt/test/foo mkfifo /mnt/test/foo/fifo1 xfs_quota -xc "project -sp /mnt/test/foo 9" /mnt/test > Setting up project 9 (path /mnt/test/foo)... > xfs_quota: skipping special file /mnt/test/foo/fifo1 > Processed 1 (/etc/projects and cmdline) paths for project 9 with recursion depth infinite (-1). ln /mnt/test/foo/fifo1 /mnt/test/foo/fifo1_link > ln: failed to create hard link '/mnt/test/testdir/fifo1_link' => '/mnt/test/testdir/fifo1': Invalid cross-device link mkfifo /mnt/test/foo/fifo2 ln /mnt/test/foo/fifo2 /mnt/test/foo/fifo2_link Fix this by allowing linking of special files to the project quota if special files doesn't have any ID set (ID = 0). Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-27fs,block: yield devices earlyChristian Brauner2-4/+4
Currently a device is only really released once the umount returns to userspace due to how file closing works. That ultimately could cause an old umount assumption to be violated that concurrent umount and mount don't fail. So an exclusively held device with a temporary holder should be yielded before the filesystem is gone. Add a helper that allows callers to do that. This also allows us to remove the two holder ops that Linus wasn't excited about. Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-25xfs: don't use current->journal_infoDave Chinner4-21/+7
syzbot reported an ext4 panic during a page fault where found a journal handle when it didn't expect to find one. The structure it tripped over had a value of 'TRAN' in the first entry in the structure, and that indicates it tripped over a struct xfs_trans instead of a jbd2 handle. The reason for this is that the page fault was taken during a copy-out to a user buffer from an xfs bulkstat operation. XFS uses an "empty" transaction context for bulkstat to do automated metadata buffer cleanup, and so the transaction context is valid across the copyout of the bulkstat info into the user buffer. We are using empty transaction contexts like this in XFS to reduce the risk of failing to release objects we reference during the operation, especially during error handling. Hence we really need to ensure that we can take page faults from these contexts without leaving landmines for the code processing the page fault to trip over. However, this same behaviour could happen from any other filesystem that triggers a page fault or any other exception that is handled on-stack from within a task context that has current->journal_info set. Having a page fault from some other filesystem bounce into XFS where we have to run a transaction isn't a bug at all, but the usage of current->journal_info means that this could result corruption of the outer task's journal_info structure. The problem is purely that we now have two different contexts that now think they own current->journal_info. IOWs, no filesystem can allow page faults or on-stack exceptions while current->journal_info is set by the filesystem because the exception processing might use current->journal_info itself. If we end up with nested XFS transactions whilst holding an empty transaction, then it isn't an issue as the outer transaction does not hold a log reservation. If we ignore the current->journal_info usage, then the only problem that might occur is a deadlock if the exception tries to take the same locks the upper context holds. That, however, is not a problem that setting current->journal_info would solve, so it's largely an irrelevant concern here. IOWs, we really only use current->journal_info for a warning check in xfs_vm_writepages() to ensure we aren't doing writeback from a transaction context. Writeback might need to do allocation, so it can need to run transactions itself. Hence it's a debug check to warn us that we've done something silly, and largely it is not all that useful. So let's just remove all the use of current->journal_info in XFS and get rid of all the potential issues from nested contexts where current->journal_info might get misused by another filesystem context. Reported-by: syzbot+cdee56dbcdf0096ef605@syzkaller.appspotmail.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Mark Tinguely <mark.tinguely@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-25xfs: allow sunit mount option to repair bad primary sb stripe valuesDave Chinner2-11/+34
If a filesystem has a busted stripe alignment configuration on disk (e.g. because broken RAID firmware told mkfs that swidth was smaller than sunit), then the filesystem will refuse to mount due to the stripe validation failing. This failure is triggering during distro upgrades from old kernels lacking this check to newer kernels with this check, and currently the only way to fix it is with offline xfs_db surgery. This runtime validity checking occurs when we read the superblock for the first time and causes the mount to fail immediately. This prevents the rewrite of stripe unit/width via mount options that occurs later in the mount process. Hence there is no way to recover this situation without resorting to offline xfs_db rewrite of the values. However, we parse the mount options long before we read the superblock, and we know if the mount has been asked to re-write the stripe alignment configuration when we are reading the superblock and verifying it for the first time. Hence we can conditionally ignore stripe verification failures if the mount options specified will correct the issue. We validate that the new stripe unit/width are valid before we overwrite the superblock values, so we can ignore the invalid config at verification and fail the mount later if the new values are not valid. This, at least, gives users the chance of correcting the issue after a kernel upgrade without having to resort to xfs-db hacks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-22Merge tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds3-9/+22
Pull xfs fixes from Chandan Babu: - Fix invalid pointer dereference by initializing xmbuf before tracepoint function is invoked - Use memalloc_nofs_save() when inserting into quota radix tree * tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: quota radix tree allocations need to be NOFS on insert xfs: fix dev_t usage in xmbuf tracepoints
2024-03-15xfs: quota radix tree allocations need to be NOFS on insertDave Chinner1-5/+13
In converting the XFS code from GFP_NOFS to scoped contexts, we converted the quota radix tree to GFP_KERNEL. Unfortunately, it was not clearly documented that this set was because there is a dependency on the quotainfo->qi_tree_lock being taken in memory reclaim to remove dquots from the radix tree. In hindsight this is obvious, but the radix tree allocations on insert are not immediately obvious, and we avoid this for the inode cache radix trees by using preloading and hence completely avoiding the radix tree node allocation under tree lock constraints. Hence there are a few solutions here. The first is to reinstate GFP_NOFS for the radix tree and add a comment explaining why GFP_NOFS is used. The second is to use memalloc_nofs_save() on the radix tree insert context, which makes it obvious that the radix tree insert runs under GFP_NOFS constraints. The third option is to simply replace the radix tree and it's lock with an xarray which can do memory allocation safely in an insert context. The first is OK, but not really the direction we want to head. The second is my preferred short term solution. The third - converting XFS radix trees to xarray - is the longer term solution. Hence to fix the regression here, we take option 2 as it moves us in the direction we want to head with memory allocation and GFP_NOFS removal. Reported-by: syzbot+8fdff861a781522bda4d@syzkaller.appspotmail.com Reported-by: syzbot+d247769793ec169e4bf9@syzkaller.appspotmail.com Fixes: 94a69db2367e ("xfs: use __GFP_NOLOCKDEP instead of GFP_NOFS") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-15xfs: fix dev_t usage in xmbuf tracepointsDarrick J. Wong2-4/+9
Fix some inconsistencies in the xmbuf tracepoints -- they should be reporting the major/minor of the filesystem that they're associated with, so that we have some clue on whose behalf the xmbuf was created. Fix the xmbuf_free tracepoint to report the same. Don't call the trace function until the xmbuf is fully initialized. Fixes: 5076a6040ca1 ("xfs: support in-memory buffer cache target") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-13Merge tag 'xfs-6.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds176-3515/+13139
Pull xfs updates from Chandan Babu: - Online repair updates: - More ondisk structures being repaired: - Inode's mode field by trying to obtain file type value from the a directory entry - Quota counters - Link counts of inodes - FS summary counters - Support for in-memory btrees has been added to support repair of rmap btrees - Misc changes: - Report corruption of metadata to the health tracking subsystem - Enable indirect health reporting when resources are scarce - Reduce memory usage while repairing refcount btree - Extend "Bmap update" intent item to support atomic extent swapping on the realtime device - Extend "Bmap update" intent item to support extended attribute fork and unwritten extents - Code cleanups: - Bmap log intent - Btree block pointer checking - Btree readahead - Buffer target - Symbolic link code - Remove mrlock wrapper around the rwsem - Convert all the GFP_NOFS flag usages to use the scoped memalloc_nofs_save() API instead of direct calls with the GFP_NOFS - Refactor and simplify xfile abstraction. Lower level APIs in shmem.c are required to be exported in order to achieve this - Skip checking alignment constraints for inode chunk allocations when block size is larger than inode chunk size - Do not submit delwri buffers collected during log recovery when an error has been encountered - Fix SEEK_HOLE/DATA for file regions which have active COW extents - Fix lock order inversion when executing error handling path during shrinking a filesystem - Remove duplicate ifdefs * tag 'xfs-6.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (183 commits) xfs: shrink failure needs to hold AGI buffer mm/shmem.c: Use new form of *@param in kernel-doc kernel-doc: Add unary operator * to $type_param_ref xfs: use kvfree() in xlog_cil_free_logvec() xfs: xfs_btree_bload_prep_block() should use __GFP_NOFAIL xfs: fix scrub stats file permissions xfs: fix log recovery erroring out on refcount recovery failure xfs: move symlink target write function to libxfs xfs: move remote symlink target read function to libxfs xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h xfs: xfs_bmap_finish_one should map unwritten extents properly xfs: support deferred bmap updates on the attr fork xfs: support recovering bmap intent items targetting realtime extents xfs: add a realtime flag to the bmap update log redo items xfs: add a xattr_entry helper xfs: fix xfs_bunmapi to allow unmapping of partial rt extents xfs: move xfs_bmap_defer_add to xfs_bmap_item.c xfs: reuse xfs_bmap_update_cancel_item xfs: add a bi_entry helper xfs: remove xfs_trans_set_bmap_flags ...
2024-03-12mm, slab: remove last vestiges of SLAB_MEM_SPREADLinus Torvalds1-4/+3
Yes, yes, I know the slab people were planning on going slow and letting every subsystem fight this thing on their own. But let's just rip off the band-aid and get it over and done with. I don't want to see a number of unnecessary pull requests just to get rid of a flag that no longer has any meaning. This was mainly done with a couple of 'sed' scripts and then some manual cleanup of the end result. Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-11Merge tag 'vfs-6.9.uuid' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs uuid updates from Christian Brauner: "This adds two new ioctl()s for getting the filesystem uuid and retrieving the sysfs path based on the path of a mounted filesystem. Getting the filesystem uuid has been implemented in filesystem specific code for a while it's now lifted as a generic ioctl" * tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: add support for FS_IOC_GETFSSYSFSPATH fs: add FS_IOC_GETFSSYSFSPATH fat: Hook up sb->s_uuid fs: FS_IOC_GETUUID ovl: convert to super_set_uuid() fs: super_set_uuid()
2024-03-11Merge tag 'vfs-6.9.super' of ↵Linus Torvalds3-29/+29
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_*() via struct bdev_handle bdev_file_open_by_*() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...
2024-03-11Merge tag 'vfs-6.9.iomap' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull iomap updates from Christian Brauner: - Restore read-write hints in struct bio through the bi_write_hint member for the sake of UFS devices in mobile applications. This can result in up to 40% lower write amplification in UFS devices. The patch series that builds on this will be coming in via the SCSI maintainers (Bart) - Overhaul the iomap writeback code. Afterwards ->map_blocks() is able to map multiple blocks at once as long as they're in the same folio. This reduces CPU usage for buffered write workloads on e.g., xfs on systems with lots of cores (Christoph) - Record processed bytes in iomap_iter() trace event (Kassey) - Extend iomap_writepage_map() trace event after Christoph's ->map_block() changes to map mutliple blocks at once (Zhang) * tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) iomap: Add processed for iomap_iter iomap: add pos and dirty_len into trace_iomap_writepage_map block, fs: Restore the per-bio/request data lifetime fields fs: Propagate write hints to the struct block_device inode fs: Move enum rw_hint into a new header file fs: Split fcntl_rw_hint() fs: Verify write lifetime constants at compile time fs: Fix rw_hint validation iomap: pass the length of the dirty region to ->map_blocks iomap: map multiple blocks at a time iomap: submit ioends immediately iomap: factor out a iomap_writepage_map_block helper iomap: only call mapping_set_error once for each failed bio iomap: don't chain bios iomap: move the iomap_sector sector calculation out of iomap_add_to_ioend iomap: clean up the iomap_alloc_ioend calling convention iomap: move all remaining per-folio logic into iomap_writepage_map iomap: factor out a iomap_writepage_handle_eof helper iomap: move the PF_MEMALLOC check to iomap_writepages iomap: move the io_folios field out of struct iomap_ioend ...
2024-03-07xfs: shrink failure needs to hold AGI bufferDave Chinner1-1/+10
Chandan reported a AGI/AGF lock order hang on xfs/168 during recent testing. The cause of the problem was the task running xfs_growfs to shrink the filesystem. A failure occurred trying to remove the free space from the btrees that the shrink would make disappear, and that meant it ran the error handling for a partial failure. This error path involves restoring the per-ag block reservations, and that requires calculating the amount of space needed to be reserved for the free inode btree. The growfs operation hung here: [18679.536829] down+0x71/0xa0 [18679.537657] xfs_buf_lock+0xa4/0x290 [xfs] [18679.538731] xfs_buf_find_lock+0xf7/0x4d0 [xfs] [18679.539920] xfs_buf_lookup.constprop.0+0x289/0x500 [xfs] [18679.542628] xfs_buf_get_map+0x2b3/0xe40 [xfs] [18679.547076] xfs_buf_read_map+0xbb/0x900 [xfs] [18679.562616] xfs_trans_read_buf_map+0x449/0xb10 [xfs] [18679.569778] xfs_read_agi+0x1cd/0x500 [xfs] [18679.573126] xfs_ialloc_read_agi+0xc2/0x5b0 [xfs] [18679.578708] xfs_finobt_calc_reserves+0xe7/0x4d0 [xfs] [18679.582480] xfs_ag_resv_init+0x2c5/0x490 [xfs] [18679.586023] xfs_ag_shrink_space+0x736/0xd30 [xfs] [18679.590730] xfs_growfs_data_private.isra.0+0x55e/0x990 [xfs] [18679.599764] xfs_growfs_data+0x2f1/0x410 [xfs] [18679.602212] xfs_file_ioctl+0xd1e/0x1370 [xfs] trying to get the AGI lock. The AGI lock was held by a fstress task trying to do an inode allocation, and it was waiting on the AGF lock to allocate a new inode chunk on disk. Hence deadlock. The fix for this is for the growfs code to hold the AGI over the transaction roll it does in the error path. It already holds the AGF locked across this, and that is what causes the lock order inversion in the xfs_ag_resv_init() call. Reported-by: Chandan Babu R <chandanbabu@kernel.org> Fixes: 46141dc891f7 ("xfs: introduce xfs_ag_shrink_space()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-28xfs: use kvfree() in xlog_cil_free_logvec()Dave Chinner1-2/+2
The xfs_log_vec items are allocated by xlog_kvmalloc(), and so need to be freed with kvfree. This was missed when coverting from the kmem_free() API. Fixes: 49292576136f ("xfs: convert kmem_free() for kvmalloc users to kvfree()") Reported-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-28xfs: xfs_btree_bload_prep_block() should use __GFP_NOFAILDave Chinner1-1/+1
This was missed in the conversion from KM* flags. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 10634530f7ba ("xfs: convert kmem_zalloc() to kzalloc()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-27xfs: drop experimental warning for FSDAXShiyang Ruan1-1/+0
FSDAX and reflink can work together now, let's drop this warning. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-26xfs: fix scrub stats file permissionsDarrick J. Wong1-2/+2
When the kernel is in lockdown mode, debugfs will only show files that are world-readable and cannot be written, mmaped, or used with ioctl. That more or less describes the scrub stats file, except that the permissions are wrong -- they should be 0444, not 0644. You can't write the stats file, so the 0200 makes no sense. Meanwhile, the clear_stats file is only writable, but it got mode 0400 instead of 0200, which would make more sense. Fix both files so that they make sense. Fixes: d7a74cad8f451 ("xfs: track usage statistics of online fsck") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-25xfs: port block device access to filesChristian Brauner3-29/+29
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-7-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25bdev: open block device as filesChristian Brauner1-1/+1
Add two new helpers to allow opening block devices as files. This is not the final infrastructure. This still opens the block device before opening a struct a file. Until we have removed all references to struct bdev_handle we can't switch the order: * Introduce blk_to_file_flags() to translate from block specific to flags usable to pen a new file. * Introduce bdev_file_open_by_{dev,path}(). * Introduce temporary sb_bdev_handle() helper to retrieve a struct bdev_handle from a block device file and update places that directly reference struct bdev_handle to rely on it. * Don't count block device openes against the number of open files. A bdev_file_open_by_{dev,path}() file is never installed into any file descriptor table. One idea that came to mind was to use kernel_tmpfile_open() which would require us to pass a path and it would then call do_dentry_open() going through the regular fops->open::blkdev_open() path. But then we're back to the problem of routing block specific flags such as BLK_OPEN_RESTRICT_WRITES through the open path and would have to waste FMODE_* flags every time we add a new one. With this we can avoid using a flag bit and we have more leeway in how we open block devices from bdev_open_by_{dev,path}(). Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-24xfs: fix log recovery erroring out on refcount recovery failureDarrick J. Wong1-0/+1
Per the comment in the error case of xfs_reflink_recover_cow, zero out any error (after shutting down the log) so that we actually kill any new intent items that might have gotten logged by later recovery steps. Discovered by xfs/434, which few people actually seem to run. Fixes: 2c1e31ed5c88 ("xfs: place intent recovery under NOFS allocation context") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-22xfs: move symlink target write function to libxfsDarrick J. Wong3-64/+84
Move xfs_symlink_write_target to xfs_symlink_remote.c so that kernel and mkfs can share the same function. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: move remote symlink target read function to libxfsDarrick J. Wong5-77/+80
Move xfs_readlink_bmap_ilocked to xfs_symlink_remote.c so that the swapext code can use it to convert a remote format symlink back to shortform format after a metadata repair. While we're at it, fix a broken printf prefix. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.hDarrick J. Wong8-14/+28
Move declarations for libxfs symlink functions into a separate header file like we do for most everything else. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: xfs_bmap_finish_one should map unwritten extents properlyDarrick J. Wong1-0/+2
The deferred bmap work state and the log item can transmit unwritten state, so the XFS_BMAP_MAP handler must map in extents with that unwritten state. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: support deferred bmap updates on the attr forkDarrick J. Wong4-38/+29
The deferred bmap update log item has always supported the attr fork, so plumb this in so that higher layers can access this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: support recovering bmap intent items targetting realtime extentsDarrick J. Wong1-0/+9
Now that we have reflink on the realtime device, bmap intent items have to support remapping extents on the realtime volume. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: add a realtime flag to the bmap update log redo itemsDarrick J. Wong3-6/+29
Extend the bmap update (BUI) log items with a new realtime flag that indicates that the updates apply against a realtime file's data fork. We'll wire up the actual code later. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: fix xfs_bunmapi to allow unmapping of partial rt extentsDarrick J. Wong1-2/+2
When XFS_BMAPI_REMAP is passed to bunmapi, that means that we want to remove part of a block mapping without touching the allocator. For realtime files with rtextsize > 1, that also means that we should skip all the code that changes a partial remove request into an unwritten extent conversion. IOWs, bunmapi in this mode should handle removing the mapping from the rt file and nothing else. Note that XFS_BMAPI_REMAP callers are required to decrement the reference count and/or free the space manually. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: add a xattr_entry helperDarrick J. Wong1-4/+7
Add a helper to translate from the item list head to the attr_intent item structure and use it so shorten assignments and avoid the need for extra local variables. Inspired-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: move xfs_bmap_defer_add to xfs_bmap_item.cDarrick J. Wong4-8/+20
Move the code that adds the incore xfs_bmap_item deferred work data to a transaction live with the BUI log item code. This means that the file mapping code no longer has to know about the inner workings of the BUI log items. As a consequence, we can hide the _get_group helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: reuse xfs_bmap_update_cancel_itemDarrick J. Wong1-13/+12
Reuse xfs_bmap_update_cancel_item to put the AG/RTG and free the item in a few places that currently open code the logic. Inspired-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: add a bi_entry helperDarrick J. Wong1-10/+9
Add a helper to translate from the item list head to the bmap_intent structure and use it so shorten assignments and avoid the need for extra local variables. Inspired-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: remove xfs_trans_set_bmap_flagsDarrick J. Wong1-25/+13
Remove this single-use helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: clean up bmap log intent item tracepoint callsitesDarrick J. Wong4-47/+33
Pass the incore bmap structure to the tracepoints instead of open-coding the argument passing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: split tracepoint classes for deferred itemsDarrick J. Wong1-85/+166
We're about to start adding support for deferred log intent items for realtime extents, so split these four types into separate classes so that we can customize them as the transition happens. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: port refcount repair to the new refcount bag structureDarrick J. Wong4-107/+81
Port the refcount record generating code to use the new refcount bag data structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: create refcount bag structure for btree repairsDarrick J. Wong5-0/+401
Create a bag structure for refcount information that uses the refcount bag btree defined in the previous patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: define an in-memory btree for storing refcount bag info during repairsDarrick J. Wong5-1/+390
Create a new in-memory btree type so that we can store refcount bag info in a much more memory-efficient and performant format. Recall that the refcount recordset regenerator computes the new recordset from browsing the rmap records. Let's say that the rmap records are: {agbno: 10, length: 40, ...} {agbno: 11, length: 3, ...} {agbno: 12, length: 20, ...} {agbno: 15, length: 1, ...} It is convenient to have a data structure that could quickly tell us the refcount for an arbitrary agbno without wasting memory. An array or a list could do that pretty easily. List suck because of the pointer overhead. xfarrays are a lot more compact, but we want to minimize sparse holes in the xfarray to constrain memory usage. Maintaining any kind of record order isn't needed for correctness, so I created the "rcbag", which is shorthand for an unordered list of (excerpted) reverse mappings. So we add the first rmap to the rcbag, and it looks like: 0: {agbno: 10, length: 40} The refcount for agbno 10 is 1. Then we move on to block 11, so we add the second rmap: 0: {agbno: 10, length: 40} 1: {agbno: 11, length: 3} The refcount for agbno 11 is 2. We move on to block 12, so we add the third: 0: {agbno: 10, length: 40} 1: {agbno: 11, length: 3} 2: {agbno: 12, length: 20} The refcount for agbno 12 and 13 is 3. We move on to block 14, and remove the second rmap: 0: {agbno: 10, length: 40} 1: NULL 2: {agbno: 12, length: 20} The refcount for agbno 14 is 2. We move on to block 15, and add the last rmap. But we don't care where it is and we don't want to expand the array so we put it in slot 1: 0: {agbno: 10, length: 40} 1: {agbno: 15, length: 1} 2: {agbno: 12, length: 20} The refcount for block 15 is 3. Notice how order doesn't matter in this list? That's why repair uses an unordered list, or "bag". The data structure is not a set because it does not guarantee uniqueness. That said, adding and removing specific items is now an O(n) operation because we have no idea where that item might be in the list. Overall, the runtime is O(n^2) which is bad. I realized that I could easily refactor the btree code and reimplement the refcount bag with an xfbtree. Adding and removing is now O(log2 n), so the runtime is at least O(n log2 n), which is much faster. In the end, the rcbag becomes a sorted list, but that's merely a detail of the implementation. The repair code doesn't care. (Note: That horrible xfs_db bmap_inflate command can be used to exercise this sort of rcbag insanity by cranking up refcounts quickly.) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: hook live rmap operations during a repair operationDarrick J. Wong12-39/+392
Hook the regular rmap code when an rmapbt repair operation is running so that we can unlock the AGF buffer to scan the filesystem and keep the in-memory btree up to date during the scan. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: create a shadow rmap btree during rmap repairDarrick J. Wong9-84/+377
Create an in-memory btree of rmap records instead of an array. This enables us to do live record collection instead of freezing the fs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: repair the rmapbtDarrick J. Wong18-19/+1608
Rebuild the reverse mapping btree from all primary metadata. This first patch establishes the bare mechanics of finding records and putting together a new ondisk tree; more complex pieces are needed to make it work properly. Link: Documentation/filesystems/xfs-online-fsck-design.rst Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: create agblock bitmap helper to count the number of set regionsDarrick J. Wong3-0/+21
In the next patch, the rmap btree repair code will need to estimate the size of the new ondisk rmapbt. The size is a function of the number of records that will be written to disk, and the size of the recordset is the number of observations made while scanning the filesystem plus the number of OWN_AG records that will be injected into the rmap btree. OWN_AG rmap records track the free space btrees, the AGFL, and the new rmap btree itself. The repair tool uses a bitmap to record the space used for all four structures, which is why we need a function to count the number of set regions. A reviewer requested that this be pulled into a separate patch with its own justification, so here it is. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22xfs: create a helper to decide if a file mapping targets the rt volumeDarrick J. Wong4-4/+14
Create a helper so that we can stop open-coding this decision everywhere. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>