summaryrefslogtreecommitdiff
path: root/fs/nfsd/vfs.c
AgeCommit message (Collapse)AuthorFilesLines
2021-07-06nfsd: Reduce contention for the nfsd_file nf_rwsemTrond Myklebust1-2/+16
When flushing out the unstable file writes as part of a COMMIT call, try to perform most of of the data writes and waits outside the semaphore. This means that if the client is sending the COMMIT as part of a memory reclaim operation, then it can continue performing I/O, with contention for the lock occurring only once the data sync is finished. Fixes: 5011af4c698a ("nfsd: Fix stable writes") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-25nfsd: move some commit_metadata()s outside the inode lockJ. Bruce Fields1-2/+3
The commit may be time-consuming and there's no need to hold the lock for it. More of these are possible, these were just some easy ones. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-25nfsd: Prevent truncation of an unlinked inode from blocking access to its ↵Yu Hsiang Huang1-0/+5
directory Truncation of an unlinked inode may take a long time for I/O waiting, and it doesn't have to prevent access to the directory. Thus, let truncation occur outside the directory's mutex, just like do_unlinkat() does. Signed-off-by: Yu Hsiang Huang <nickhuang@synology.com> Signed-off-by: Bing Jing Chang <bingjingc@synology.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-03-22NFSD: Add a tracepoint to record directory entry encodingChuck Lever1-3/+6
Enable watching the progress of directory encoding to capture the timing of any issues with reading or encoding a directory. The new tracepoint captures dirent encoding for all NFS versions. For example, here's what a few NFSv4 directory entries might look like: nfsd-989 [002] 468.596265: nfsd_dirent: fh_hash=0x5d162594 ino=2 name=. nfsd-989 [002] 468.596267: nfsd_dirent: fh_hash=0x5d162594 ino=1 name=.. nfsd-989 [002] 468.596299: nfsd_dirent: fh_hash=0x5d162594 ino=3827 name=zlib.c nfsd-989 [002] 468.596325: nfsd_dirent: fh_hash=0x5d162594 ino=3811 name=xdiff nfsd-989 [002] 468.596351: nfsd_dirent: fh_hash=0x5d162594 ino=3810 name=xdiff-interface.h nfsd-989 [002] 468.596377: nfsd_dirent: fh_hash=0x5d162594 ino=3809 name=xdiff-interface.c Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-02-23Merge tag 'idmapped-mounts-v5.12' of ↵Linus Torvalds1-19/+31
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull idmapped mounts from Christian Brauner: "This introduces idmapped mounts which has been in the making for some time. Simply put, different mounts can expose the same file or directory with different ownership. This initial implementation comes with ports for fat, ext4 and with Christoph's port for xfs with more filesystems being actively worked on by independent people and maintainers. Idmapping mounts handle a wide range of long standing use-cases. Here are just a few: - Idmapped mounts make it possible to easily share files between multiple users or multiple machines especially in complex scenarios. For example, idmapped mounts will be used in the implementation of portable home directories in systemd-homed.service(8) where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different uids and gids. This effectively makes it possible to assign random uids and gids at login time. - It is possible to share files from the host with unprivileged containers without having to change ownership permanently through chown(2). - It is possible to idmap a container's rootfs and without having to mangle every file. For example, Chromebooks use it to share the user's Download folder with their unprivileged containers in their Linux subsystem. - It is possible to share files between containers with non-overlapping idmappings. - Filesystem that lack a proper concept of ownership such as fat can use idmapped mounts to implement discretionary access (DAC) permission checking. - They allow users to efficiently changing ownership on a per-mount basis without having to (recursively) chown(2) all files. In contrast to chown (2) changing ownership of large sets of files is instantenous with idmapped mounts. This is especially useful when ownership of a whole root filesystem of a virtual machine or container is changed. With idmapped mounts a single syscall mount_setattr syscall will be sufficient to change the ownership of all files. - Idmapped mounts always take the current ownership into account as idmappings specify what a given uid or gid is supposed to be mapped to. This contrasts with the chown(2) syscall which cannot by itself take the current ownership of the files it changes into account. It simply changes the ownership to the specified uid and gid. This is especially problematic when recursively chown(2)ing a large set of files which is commong with the aforementioned portable home directory and container and vm scenario. - Idmapped mounts allow to change ownership locally, restricting it to specific mounts, and temporarily as the ownership changes only apply as long as the mount exists. Several userspace projects have either already put up patches and pull-requests for this feature or will do so should you decide to pull this: - systemd: In a wide variety of scenarios but especially right away in their implementation of portable home directories. https://systemd.io/HOME_DIRECTORY/ - container runtimes: containerd, runC, LXD:To share data between host and unprivileged containers, unprivileged and privileged containers, etc. The pull request for idmapped mounts support in containerd, the default Kubernetes runtime is already up for quite a while now: https://github.com/containerd/containerd/pull/4734 - The virtio-fs developers and several users have expressed interest in using this feature with virtual machines once virtio-fs is ported. - ChromeOS: Sharing host-directories with unprivileged containers. I've tightly synced with all those projects and all of those listed here have also expressed their need/desire for this feature on the mailing list. For more info on how people use this there's a bunch of talks about this too. Here's just two recent ones: https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf https://fosdem.org/2021/schedule/event/containers_idmap/ This comes with an extensive xfstests suite covering both ext4 and xfs: https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts It covers truncation, creation, opening, xattrs, vfscaps, setid execution, setgid inheritance and more both with idmapped and non-idmapped mounts. It already helped to discover an unrelated xfs setgid inheritance bug which has since been fixed in mainline. It will be sent for inclusion with the xfstests project should you decide to merge this. In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before and this is verified in the testsuite. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. The user namespace the mount will be marked with can be specified by passing a file descriptor refering to the user namespace as an argument to the new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern of extensibility. The following conditions must be met in order to create an idmapped mount: - The caller must currently have the CAP_SYS_ADMIN capability in the user namespace the underlying filesystem has been mounted in. - The underlying filesystem must support idmapped mounts. - The mount must not already be idmapped. This also implies that the idmapping of a mount cannot be altered once it has been idmapped. - The mount must be a detached/anonymous mount, i.e. it must have been created by calling open_tree() with the OPEN_TREE_CLONE flag and it must not already have been visible in the filesystem. The last two points guarantee easier semantics for userspace and the kernel and make the implementation significantly simpler. By default vfsmounts are marked with the initial user namespace and no behavioral or performance changes are observed. The manpage with a detailed description can be found here: https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8 In order to support idmapped mounts, filesystems need to be changed and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The patches to convert individual filesystem are not very large or complicated overall as can be seen from the included fat, ext4, and xfs ports. Patches for other filesystems are actively worked on and will be sent out separately. The xfstestsuite can be used to verify that port has been done correctly. The mount_setattr() syscall is motivated independent of the idmapped mounts patches and it's been around since July 2019. One of the most valuable features of the new mount api is the ability to perform mounts based on file descriptors only. Together with the lookup restrictions available in the openat2() RESOLVE_* flag namespace which we added in v5.6 this is the first time we are close to hardened and race-free (e.g. symlinks) mounting and path resolution. While userspace has started porting to the new mount api to mount proper filesystems and create new bind-mounts it is currently not possible to change mount options of an already existing bind mount in the new mount api since the mount_setattr() syscall is missing. With the addition of the mount_setattr() syscall we remove this last restriction and userspace can now fully port to the new mount api, covering every use-case the old mount api could. We also add the crucial ability to recursively change mount options for a whole mount tree, both removing and adding mount options at the same time. This syscall has been requested multiple times by various people and projects. There is a simple tool available at https://github.com/brauner/mount-idmapped that allows to create idmapped mounts so people can play with this patch series. I'll add support for the regular mount binary should you decide to pull this in the following weeks: Here's an example to a simple idmapped mount of another user's home directory: u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt u1001@f2-vm:/$ ls -al /home/ubuntu/ total 28 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ ls -al /mnt/ total 28 drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 . drwxr-xr-x 29 root root 4096 Oct 28 22:01 .. -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ touch /mnt/my-file u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file u1001@f2-vm:/$ ls -al /mnt/my-file -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file u1001@f2-vm:/$ ls -al /home/ubuntu/my-file -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file u1001@f2-vm:/$ getfacl /mnt/my-file getfacl: Removing leading '/' from absolute path names # file: mnt/my-file # owner: u1001 # group: u1001 user::rw- user:u1001:rwx group::rw- mask::rwx other::r-- u1001@f2-vm:/$ getfacl /home/ubuntu/my-file getfacl: Removing leading '/' from absolute path names # file: home/ubuntu/my-file # owner: ubuntu # group: ubuntu user::rw- user:ubuntu:rwx group::rw- mask::rwx other::r--" * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits) xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl xfs: support idmapped mounts ext4: support idmapped mounts fat: handle idmapped mounts tests: add mount_setattr() selftests fs: introduce MOUNT_ATTR_IDMAP fs: add mount_setattr() fs: add attr_flags_to_mnt_flags helper fs: split out functions to hold writers namespace: only take read lock in do_reconfigure_mnt() mount: make {lock,unlock}_mount_hash() static namespace: take lock_mount_hash() directly when changing flags nfs: do not export idmapped mounts overlayfs: do not mount on top of idmapped mounts ecryptfs: do not mount on top of idmapped mounts ima: handle idmapped mounts apparmor: handle idmapped mounts fs: make helpers idmap mount aware exec: handle idmapped mounts would_dump: handle idmapped mounts ...
2021-01-25nfsd: report per-export statsAmir Goldstein1-2/+2
Collect some nfsd stats per export in addition to the global stats. A new nfsdfs export_stats file is created. It uses the same ops as the exports file to iterate the export entries and we use the file's name to determine the reported info per export. For example: $ cat /proc/fs/nfsd/export_stats # Version 1.1 # Path Client Start-time # Stats /test localhost 92 fh_stale: 0 io_read: 9 io_write: 1 Every export entry reports the start time when stats collection started, so stats collecting scripts can know if stats where reset between samples. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25nfsd: protect concurrent access to nfsd stats countersAmir Goldstein1-2/+2
nfsd stats counters can be updated by concurrent nfsd threads without any protection. Convert some nfsd_stats and nfsd_net struct members to use percpu counters. The longest_chain* members of struct nfsd_net remain unprotected. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-24namei: prepare for idmapped mountsChristian Brauner1-8/+11
The various vfs_*() helpers are called by filesystems or by the vfs itself to perform core operations such as create, link, mkdir, mknod, rename, rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace and pass it down. Afterwards the checks and operations are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: introduce struct renamedataChristian Brauner1-1/+7
In order to handle idmapped mounts we will extend the vfs rename helper to take two new arguments in follow up patches. Since this operations already takes a bunch of arguments add a simple struct renamedata and make the current helper use it before we extend it. Link: https://lore.kernel.org/r/20210121131959.646623-14-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24xattr: handle idmapped mountsTycho Andersen1-6/+8
When interacting with extended attributes the vfs verifies that the caller is privileged over the inode with which the extended attribute is associated. For posix access and posix default extended attributes a uid or gid can be stored on-disk. Let the functions handle posix extended attributes on idmapped mounts. If the inode is accessed through an idmapped mount we need to map it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. This has no effect for e.g. security xattrs since they don't store uids or gids and don't perform permission checks on them like posix acls do. Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24attr: handle idmapped mountsChristian Brauner1-2/+2
When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: make permission helpers idmapped mount awareChristian Brauner1-2/+3
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-09nfsd: Set PF_LOCAL_THROTTLE on local filesystems onlyTrond Myklebust1-2/+11
Don't set PF_LOCAL_THROTTLE on remote filesystems like NFS, since they aren't expected to ever be subject to double buffering. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-12-09nfsd: close cached files prior to a REMOVE or RENAME that would replace targetJeff Layton1-7/+9
It's not uncommon for some workloads to do a bunch of I/O to a file and delete it just afterward. If knfsd has a cached open file however, then the file may still be open when the dentry is unlinked. If the underlying filesystem is nfs, then that could trigger it to do a sillyrename. On a REMOVE or RENAME scan the nfsd_file cache for open files that correspond to the inode, and proactively unhash and put their references. This should prevent any delete-on-last-close activity from occurring, solely due to knfsd's open file cache. This must be done synchronously though so we use the variants that call flush_delayed_fput. There are deadlock possibilities if you call flush_delayed_fput while holding locks, however. In the case of nfsd_rename, we don't even do the lookups of the dentries to be renamed until we've locked for rename. Once we've figured out what the target dentry is for a rename, check to see whether there are cached open files associated with it. If there are, then unwind all of the locking, close them all, and then reattempt the rename. None of this is really necessary for "typical" filesystems though. It's mostly of use for NFS, so declare a new export op flag and use that to determine whether to close the files beforehand. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-09-25NFSD: Correct type annotations in user xattr helpersChuck Lever1-2/+4
Squelch some sparse warnings: /home/cel/src/linux/linux/fs/nfsd/vfs.c:2264:13: warning: incorrect type in assignment (different base types) /home/cel/src/linux/linux/fs/nfsd/vfs.c:2264:13: expected int err /home/cel/src/linux/linux/fs/nfsd/vfs.c:2264:13: got restricted __be32 /home/cel/src/linux/linux/fs/nfsd/vfs.c:2266:24: warning: incorrect type in return expression (different base types) /home/cel/src/linux/linux/fs/nfsd/vfs.c:2266:24: expected restricted __be32 /home/cel/src/linux/linux/fs/nfsd/vfs.c:2266:24: got int err /home/cel/src/linux/linux/fs/nfsd/vfs.c:2288:13: warning: incorrect type in assignment (different base types) /home/cel/src/linux/linux/fs/nfsd/vfs.c:2288:13: expected int err /home/cel/src/linux/linux/fs/nfsd/vfs.c:2288:13: got restricted __be32 /home/cel/src/linux/linux/fs/nfsd/vfs.c:2290:24: warning: incorrect type in return expression (different base types) /home/cel/src/linux/linux/fs/nfsd/vfs.c:2290:24: expected restricted __be32 /home/cel/src/linux/linux/fs/nfsd/vfs.c:2290:24: got int err Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-2/+2
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-07-13nfsd: take xattr bits into account for permission checksFrank van der Linden1-0/+12
Since the NFSv4.2 extended attributes extension defines 3 new access bits for xattr operations, take them in to account when validating what the client is asking for, and when checking permissions. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-13nfsd: define xattr functions to call into their vfs counterpartsFrank van der Linden1-0/+227
This adds the filehandle based functions for the xattr operations that call in to the vfs layer to do the actual work. Signed-off-by: Frank van der Linden <fllinden@amazon.com> [ cel: address checkpatch.pl complaint ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-06-17nfsd: apply umask on fs without ACL supportJ. Bruce Fields1-0/+6
The server is failing to apply the umask when creating new objects on filesystems without ACL support. To reproduce this, you need to use NFSv4.2 and a client and server recent enough to support umask, and you need to export a filesystem that lacks ACL support (for example, ext4 with the "noacl" mount option). Filesystems with ACL support are expected to take care of the umask themselves (usually by calling posix_acl_create). For filesystems without ACL support, this is up to the caller of vfs_create(), vfs_mknod(), or vfs_mkdir(). Reported-by: Elliott Mitchell <ehem+debian@m5p.com> Reported-by: Salvatore Bonaccorso <carnil@debian.org> Tested-by: Salvatore Bonaccorso <carnil@debian.org> Fixes: 47057abde515 ("nfsd: add support for the umask attribute") Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-06-02mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLENeilBrown1-4/+5
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a daemon needs to write to one bdi (the final bdi) in order to free up writes queued to another bdi (the client bdi). The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. The purpose of this is to avoid deadlock that happen when the PF_LESS_THROTTLE process must write for any dirty pages to be freed, but it is being thottled and cannot write. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. The important issue is not that the daemon needs a *larger* dirty page allowance, but that it needs a *private* dirty page allowance, so that dirty pages for the "client" bdi that it is helping to clear (the bdi for an NFS filesystem or loop block device etc) do not affect the throttling of the daemon writing to the "final" bdi. This patch changes the heuristic so that the task is not throttled when the bdi it is writing to has a dirty page count below below (or equal to) the free-run threshold for that bdi. This ensures it will always be able to have some pages in flight, and so will not deadlock. In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might still be throttled by global threshold, but that is acceptable as it is only the deadlock state that is interesting for this flag. This approach of "only throttle when target bdi is busy" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - removes the 25% bonus that that flag gives, and - If PF_LOCAL_THROTTLE is set, don't delay at all unless the global and the local free-run thresholds are exceeded. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. [akpm@linux-foundation.org: coding style fixes] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Chuck Lever <chuck.lever@oracle.com> [nfsd] Cc: Christoph Hellwig <hch@lst.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-22nfsd: Ensure sampling of the write verifier is atomic with the writeTrond Myklebust1-3/+9
When doing an unstable write, we need to ensure that we sample the write verifier before releasing the lock, and allowing a commit to the same file to proceed. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-01-22nfsd: Ensure sampling of the commit verifier is atomic with the commitTrond Myklebust1-2/+6
When we have a successful commit, ensure we sample the commit verifier before releasing the lock. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-01-22nfsd: Ensure exclusion between CLONE and WRITE errorsTrond Myklebust1-7/+18
Ensure that we can distinguish between synchronous CLONE and WRITE errors, and that we can assign them correctly. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-01-22nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range()Trond Myklebust1-2/+4
Needed in order to fix exclusion w.r.t. writes. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-01-22nfsd: Update the boot verifier on stable writes too.Trond Myklebust1-1/+4
We don't know if the error returned by the fsync() call is exclusive to the data written by the stable write, so play it safe. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-01-22nfsd: Fix stable writesTrond Myklebust1-2/+16
Strictly speaking, a stable write error needs to reflect the write + the commit of that write (and only that write). To ensure that we don't pick up the write errors from other writebacks, add a rw_semaphore to provide exclusion. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-01-22nfsd: Allow nfsd_vfs_write() to take the nfsd_file as an argumentTrond Myklebust1-2/+3
Needed in order to fix stable writes. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-01-03nfsd: use true,false for bool variable in vfs.czhengbin1-3/+3
Fixes coccicheck warning: fs/nfsd/vfs.c:1389:5-13: WARNING: Assignment of 0/1 to bool variable fs/nfsd/vfs.c:1398:5-13: WARNING: Assignment of 0/1 to bool variable fs/nfsd/vfs.c:1415:2-10: WARNING: Assignment of 0/1 to bool variable Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-12-19nfsd: pass a 64-bit guardtime to nfsd_setattr()Arnd Bergmann1-2/+2
Guardtime handling in nfs3 differs between 32-bit and 64-bit architectures, and uses the deprecated time_t type. Change it to using time64_t, which behaves the same way on 64-bit and 32-bit architectures, treating the number as an unsigned 32-bit entity with a range of year 1970 to 2106 consistently, and avoiding the y2038 overflow. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-12-19nfsd: Clone should commit src file metadata tooTrond Myklebust1-5/+14
vfs_clone_file_range() can modify the metadata on the source file too, so we need to commit that to stable storage as well. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Acked-by: Dave Chinner <david@fromorbit.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-12-17nfsd: Return the correct number of bytes written to the fileTrond Myklebust1-0/+1
We must allow for the fact that iov_iter_write() could have returned a short write (e.g. if there was an ENOSPC issue). Fixes: d890be159a71 "nfsd: Add I/O trace points in the NFSv4 write path" Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-30nfsd: check for EBUSY from vfs_rmdir/vfs_unink.NeilBrown1-1/+11
vfs_rmdir and vfs_unlink can return -EBUSY if the target is a mountpoint. This currently gets passed to nfserrno() by nfsd_unlink(), and that results in a WARNing, which is not user-friendly. Possibly the best NFSv4 error is NFS4ERR_FILE_OPEN, because there is a sense in which the object is currently in use by some other task. The Linux NFSv4 client will map this back to EBUSY, which is an added benefit. For NFSv3, the best we can do is probably NFS3ERR_ACCES, which isn't true, but is not less true than the other options. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-30nfsd: Ensure CLONE persists data and metadata changes to the target fileTrond Myklebust1-1/+7
The NFSv4.2 CLONE operation has implicit persistence requirements on the target file, since there is no protocol requirement that the client issue a separate operation to persist data. For that reason, we should call vfs_fsync_range() on the destination file after a successful call to vfs_clone_file_range(). Fixes: ffa0160a1039 ("nfsd: implement the NFSv4.2 CLONE operation") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-09-23nfsd: fix nfs read eof detectionTrond Myklebust1-11/+26
Currently, the knfsd server assumes that a short read indicates an end of file. That assumption is incorrect. The short read means that either we've hit the end of file, or we've hit a read error. In the case of a read error, the client may want to retry (as per the implementation recommendations in RFC1813 and RFC7530), but currently it is being told that it hit an eof. Move the code to detect eof from version specific code into the generic nfsd read. Report eof only in the two following cases: 1) read() returns a zero length short read with no error. 2) the offset+length of the read is >= the file size. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-09-10nfsd: Reset the boot verifier on all write I/O errorsTrond Myklebust1-4/+15
If multiple clients are writing to the same file, then due to the fact we share a single file descriptor between all NFSv3 clients writing to the file, we have a situation where clients can miss the fact that their file data was not persisted. While this should be rare, it could cause silent data loss in situations where multiple clients are using NLM locking or O_DIRECT to write to the same file. Unfortunately, the stateless nature of NFSv3 and the fact that we can only identify clients by their IP address means that we cannot trivially cache errors; we would not know when it is safe to release them from the cache. So the solution is to declare a reboot. We understand that this should be a rare occurrence, since disks are usually stable. The most frequent occurrence is likely to be ENOSPC, at which point all writes to the given filesystem are likely to fail anyway. So the expectation is that clients will be forced to retry their writes until they hit the fatal error. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-08-19nfsd: close cached files prior to a REMOVE or RENAME that would replace targetJeff Layton1-9/+53
It's not uncommon for some workloads to do a bunch of I/O to a file and delete it just afterward. If knfsd has a cached open file however, then the file may still be open when the dentry is unlinked. If the underlying filesystem is nfs, then that could trigger it to do a sillyrename. On a REMOVE or RENAME scan the nfsd_file cache for open files that correspond to the inode, and proactively unhash and put their references. This should prevent any delete-on-last-close activity from occurring, solely due to knfsd's open file cache. This must be done synchronously though so we use the variants that call flush_delayed_fput. There are deadlock possibilities if you call flush_delayed_fput while holding locks, however. In the case of nfsd_rename, we don't even do the lookups of the dentries to be renamed until we've locked for rename. Once we've figured out what the target dentry is for a rename, check to see whether there are cached open files associated with it. If there are, then unwind all of the locking, close them all, and then reattempt the rename. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-08-19nfsd: rip out the raparms cacheJeff Layton1-149/+0
The raparms cache was set up in order to ensure that we carry readahead information forward from one RPC call to the next. In other words, it was set up because each RPC call was forced to open a struct file, then close it, causing the loss of readahead information that is normally cached in that struct file, and used to keep the page cache filled when a user calls read() multiple times on the same file descriptor. Now that we cache the struct file, and reuse it for all the I/O calls to a given file by a given user, we no longer have to keep a separate readahead cache. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-08-19nfsd: hook nfsd_commit up to the nfsd_file cacheJeff Layton1-7/+7
Use cached filps if possible instead of opening a new one every time. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-08-19nfsd: hook up nfsd_read to the nfsd_file cacheJeff Layton1-7/+4
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-08-19nfsd: hook up nfsd_write to the new nfsd_file cacheJeff Layton1-5/+7
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-08-19nfsd: add a new struct file caching facility to nfsdJeff Layton1-22/+43
Currently, NFSv2/3 reads and writes have to open a file, do the read or write and then close it again for each RPC. This is highly inefficient, especially when the underlying filesystem has a relatively slow open routine. This patch adds a new open file cache to knfsd. Rather than doing an open for each RPC, the read/write handlers can call into this cache to see if there is one already there for the correct filehandle and NFS_MAY_READ/WRITE flags. If there isn't an entry, then we create a new one and attempt to perform the open. If there is, then we wait until the entry is fully instantiated and return it if it is at the end of the wait. If it's not, then we attempt to take over construction. Since the main goal is to speed up NFSv2/3 I/O, we don't want to close these files on last put of these objects. We need to keep them around for a little while since we never know when the next READ/WRITE will come in. Cache entries have a hardcoded 1s timeout, and we have a recurring workqueue job that walks the cache and purges any entries that have expired. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Richard Sharpe <richard.sharpe@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03nfsd: Spelling s/EACCESS/EACCES/Geert Uytterhoeven1-1/+1
The correct spelling is EACCES: include/uapi/asm-generic/errno-base.h:#define EACCES 13 /* Permission denied */ Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24nfsd: fh_drop_write in nfsd_unlinkJ. Bruce Fields1-3/+5
fh_want_write() can now be called twice, but I'm also fixing up the callers not to do that. Other cases include setattr and create. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-02-06nfsd: Fix error return values for nfsd4_clone_file_range()Trond Myklebust1-2/+4
If the parameter 'count' is non-zero, nfsd4_clone_file_range() will currently clobber all errors returned by vfs_clone_file_range() and replace them with EINVAL. Fixes: 42ec3d4c0218 ("vfs: make remap_file_range functions take and...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-12-04nfsd: Return EPERM, not EACCES, in some SETATTR caseszhengbin1-2/+15
As the man(2) page for utime/utimes states, EPERM is returned when the second parameter of utime or utimes is not NULL, the caller's effective UID does not match the owner of the file, and the caller is not privileged. However, in a NFS directory mounted from knfsd, it will return EACCES (from nfsd_setattr-> fh_verify->nfsd_permission). This patch fixes that. Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-11-02Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-2/+6
Pull vfs dedup fixes from Dave Chinner: "This reworks the vfs data cloning infrastructure. We discovered many issues with these interfaces late in the 4.19 cycle - the worst of them (data corruption, setuid stripping) were fixed for XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all the problems was needed. That rework is the contents of this pull request. Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use a common .remap_file_range method and supply generic bounds and sanity checking functions that are shared with the data write path. The current VFS infrastructure has problems with rlimit, LFS file sizes, file time stamps, maximum filesystem file sizes, stripping setuid bits, etc and so they are addressed in these commits. We also introduce the ability for the ->remap_file_range methods to return short clones so that clones for vfs_copy_file_range() don't get rejected if the entire range can't be cloned. It also allows filesystems to sliently skip deduplication of partial EOF blocks if they are not capable of doing so without requiring errors to be thrown to userspace. Existing filesystems are converted to user the new remap_file_range method, and both XFS and ocfs2 are modified to make use of the new generic checking infrastructure" * tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits) xfs: remove [cm]time update from reflink calls xfs: remove xfs_reflink_remap_range xfs: remove redundant remap partial EOF block checks xfs: support returning partial reflink results xfs: clean up xfs_reflink_remap_blocks call site xfs: fix pagecache truncation prior to reflink ocfs2: remove ocfs2_reflink_remap_range ocfs2: support partial clone range and dedupe range ocfs2: fix pagecache truncation prior to reflink ocfs2: truncate page cache for clone destination file before remapping vfs: clean up generic_remap_file_range_prep return value vfs: hide file range comparison function vfs: enable remap callers that can handle short operations vfs: plumb remap flags through the vfs dedupe functions vfs: plumb remap flags through the vfs clone functions vfs: make remap_file_range functions take and return bytes completed vfs: remap helper should update destination inode metadata vfs: pass remap flags to generic_remap_checks vfs: pass remap flags to generic_remap_file_range_prep vfs: combine the clone and dedupe into a single remap_file_range ...
2018-11-01Merge branch 'work.afs' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull AFS updates from Al Viro: "AFS series, with some iov_iter bits included" * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) missing bits of "iov_iter: Separate type from direction and use accessor functions" afs: Probe multiple fileservers simultaneously afs: Fix callback handling afs: Eliminate the address pointer from the address list cursor afs: Allow dumping of server cursor on operation failure afs: Implement YFS support in the fs client afs: Expand data structure fields to support YFS afs: Get the target vnode in afs_rmdir() and get a callback on it afs: Calc callback expiry in op reply delivery afs: Fix FS.FetchStatus delivery from updating wrong vnode afs: Implement the YFS cache manager service afs: Remove callback details from afs_callback_break struct afs: Commit the status on a new file/dir/symlink afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS afs: Don't invoke the server to read data beyond EOF afs: Add a couple of tracepoints to log I/O errors afs: Handle EIO from delivery function afs: Fix TTL on VL server and address lists afs: Implement VL server rotation afs: Improve FS server rotation error handling ...
2018-10-30Merge tag 'nfsd-4.20' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-3/+2
Pull nfsd updates from Bruce Fields: "Olga added support for the NFSv4.2 asynchronous copy protocol. We already supported COPY, by copying a limited amount of data and then returning a short result, letting the client resend. The asynchronous protocol should offer better performance at the expense of some complexity. The other highlight is Trond's work to convert the duplicate reply cache to a red-black tree, and to move it and some other server caches to RCU. (Previously these have meant taking global spinlocks on every RPC) Otherwise, some RDMA work and miscellaneous bugfixes" * tag 'nfsd-4.20' of git://linux-nfs.org/~bfields/linux: (30 commits) lockd: fix access beyond unterminated strings in prints nfsd: Fix an Oops in free_session() nfsd: correctly decrement odstate refcount in error path svcrdma: Increase the default connection credit limit svcrdma: Remove try_module_get from backchannel svcrdma: Remove ->release_rqst call in bc reply handler svcrdma: Reduce max_send_sges nfsd: fix fall-through annotations knfsd: Improve lookup performance in the duplicate reply cache using an rbtree knfsd: Further simplify the cache lookup knfsd: Simplify NFS duplicate replay cache knfsd: Remove dead code from nfsd_cache_lookup SUNRPC: Simplify TCP receive code SUNRPC: Replace the cache_detail->hash_lock with a regular spinlock SUNRPC: Remove non-RCU protected lookup NFS: Fix up a typo in nfs_dns_ent_put NFS: Lockless DNS lookups knfsd: Lockless lookup of NFSv4 identities. SUNRPC: Lockless server RPCSEC_GSS context lookup knfsd: Allow lockless lookups of the exports ...
2018-10-30vfs: plumb remap flags through the vfs clone functionsDarrick J. Wong1-1/+1
Plumb a remap_flags argument through the {do,vfs}_clone_file_range functions so that clone can take advantage of it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30vfs: make remap_file_range functions take and return bytes completedDarrick J. Wong1-2/+6
Change the remap_file_range functions to take a number of bytes to operate upon and return the number of bytes they operated on. This is a requirement for allowing fs implementations to return short clone/dedupe results to the user, which will enable us to obey resource limits in a graceful manner. A subsequent patch will enable copy_file_range to signal to the ->clone_file_range implementation that it can handle a short length, which will be returned in the function's return value. For now the short return is not implemented anywhere so the behavior won't change -- either copy_file_range manages to clone the entire range or it tries an alternative. Neither clone ioctl can take advantage of this, alas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>