summaryrefslogtreecommitdiff
path: root/fs/fuse
AgeCommit message (Collapse)AuthorFilesLines
2019-09-27Merge tag 'virtio-fs-5.4' of ↵Linus Torvalds5-0/+1220
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse virtio-fs support from Miklos Szeredi: "Virtio-fs allows exporting directory trees on the host and mounting them in guest(s). This isn't actually a new filesystem, but a glue layer between the fuse filesystem and a virtio based back-end. It's similar in functionality to the existing virtio-9p solution, but significantly faster in benchmarks and has better POSIX compliance. Further permformance improvements can be achieved by sharing the page cache between host and guest, allowing for faster I/O and reduced memory use. Kata Containers have been including the out-of-tree virtio-fs (with the shared page cache patches as well) since version 1.7 as an experimental feature. They have been active in development and plan to switch from virtio-9p to virtio-fs as their default solution. There has been interest from other sources as well. The userspace infrastructure is slated to be merged into qemu once the kernel part hits mainline. This was developed by Vivek Goyal, Dave Gilbert and Stefan Hajnoczi" * tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: virtio-fs: add virtiofs filesystem virtio-fs: add Documentation/filesystems/virtiofs.rst fuse: reserve values for mapping protocol
2019-09-24fuse: Make fuse_args_to_req staticYueHaibing1-1/+1
Fix sparse warning: fs/fuse/dev.c:468:6: warning: symbol 'fuse_args_to_req' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Fixes: 68583165f962 ("fuse: add pages to fuse_args") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: fix memleak in cuse_channel_openzhengbin1-0/+1
If cuse_send_init fails, need to fuse_conn_put cc->fc. cuse_channel_open->fuse_conn_init->refcount_set(&fc->count, 1) ->fuse_dev_alloc->fuse_conn_get ->fuse_dev_free->fuse_conn_put Fixes: cc080e9e9be1 ("fuse: introduce per-instance fuse_dev structure") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: fix beyond-end-of-page access in fuse_parse_cache()Tejun Heo1-1/+3
With DEBUG_PAGEALLOC on, the following triggers. BUG: unable to handle page fault for address: ffff88859367c000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 3001067 P4D 3001067 PUD 406d3a8067 PMD 406d30c067 PTE 800ffffa6c983060 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 38 PID: 3110657 Comm: python2.7 RIP: 0010:fuse_readdir+0x88f/0xe7a [fuse] Code: 49 8b 4d 08 49 39 4e 60 0f 84 44 04 00 00 48 8b 43 08 43 8d 1c 3c 4d 01 7e 68 49 89 dc 48 03 5c 24 38 49 89 46 60 8b 44 24 30 <8b> 4b 10 44 29 e0 48 89 ca 48 83 c1 1f 48 83 e1 f8 83 f8 17 49 89 RSP: 0018:ffffc90035edbde0 EFLAGS: 00010286 RAX: 0000000000001000 RBX: ffff88859367bff0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88859367bfed RDI: 0000000000920907 RBP: ffffc90035edbe90 R08: 000000000000014b R09: 0000000000000004 R10: ffff88859367b000 R11: 0000000000000000 R12: 0000000000000ff0 R13: ffffc90035edbee0 R14: ffff889fb8546180 R15: 0000000000000020 FS: 00007f80b5f4a740(0000) GS:ffff889fffa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88859367c000 CR3: 0000001c170c2001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: iterate_dir+0x122/0x180 __x64_sys_getdents+0xa6/0x140 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 It's in fuse_parse_cache(). %rbx (ffff88859367bff0) is fuse_dirent pointer - addr + offset. FUSE_DIRENT_SIZE() is trying to dereference namelen off of it but that derefs into the next page which is disabled by pagealloc debug causing a PF. This is caused by dirent->namelen being accessed before ensuring that there's enough bytes in the page for the dirent. Fix it by pushing down reclen calculation. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 5d7bc7e8680c ("fuse: allow using readdir cache") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: unexport fuse_put_requestArnd Bergmann1-1/+0
This function has been made static, which now causes a compile-time warning: WARNING: "fuse_put_request" [vmlinux] is a static EXPORT_SYMBOL_GPL Remove the unneeded export. Fixes: 66abc3599c3c ("fuse: unexport request ops") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: kmemcg account fs dataKhazhismel Kumykov3-4/+6
account per-file, dentry, and inode data blockdev/superblock and temporary per-request data was left alone, as this usually isn't accounted Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: on 64-bit store time in d_fsdata directlyKhazhismel Kumykov1-6/+30
Implements the optimization noted in commit f75fdf22b0a8 ("fuse: don't use ->d_time"), as the additional memory can be significant. (In particular, on SLAB configurations this 8-byte alloc becomes 32 bytes). Per-dentry, this can consume significant memory. Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: fix missing unlock_page in fuse_writepage()Vasily Averin1-0/+1
unlock_page() was missing in case of an already in-flight write against the same page. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Fixes: ff17be086477 ("fuse: writepage: skip already in flight") Cc: <stable@vger.kernel.org> # v3.13 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-18virtio-fs: add virtiofs filesystemStefan Hajnoczi5-0/+1220
Add a basic file system module for virtio-fs. This does not yet contain shared data support between host and guest or metadata coherency speedups. However it is already significantly faster than virtio-9p. Design Overview =============== With the goal of designing something with better performance and local file system semantics, a bunch of ideas were proposed. - Use fuse protocol (instead of 9p) for communication between guest and host. Guest kernel will be fuse client and a fuse server will run on host to serve the requests. - For data access inside guest, mmap portion of file in QEMU address space and guest accesses this memory using dax. That way guest page cache is bypassed and there is only one copy of data (on host). This will also enable mmap(MAP_SHARED) between guests. - For metadata coherency, there is a shared memory region which contains version number associated with metadata and any guest changing metadata updates version number and other guests refresh metadata on next access. This is yet to be implemented. How virtio-fs differs from existing approaches ============================================== The unique idea behind virtio-fs is to take advantage of the co-location of the virtual machine and hypervisor to avoid communication (vmexits). DAX allows file contents to be accessed without communication with the hypervisor. The shared memory region for metadata avoids communication in the common case where metadata is unchanged. By replacing expensive communication with cheaper shared memory accesses, we expect to achieve better performance than approaches based on network file system protocols. In addition, this also makes it easier to achieve local file system semantics (coherency). These techniques are not applicable to network file system protocols since the communications channel is bypassed by taking advantage of shared memory on a local machine. This is why we decided to build virtio-fs rather than focus on 9P or NFS. Caching Modes ============= Like virtio-9p, different caching modes are supported which determine the coherency level as well. The “cache=FOO” and “writeback” options control the level of coherence between the guest and host filesystems. - cache=none metadata, data and pathname lookup are not cached in guest. They are always fetched from host and any changes are immediately pushed to host. - cache=always metadata, data and pathname lookup are cached in guest and never expire. - cache=auto metadata and pathname lookup cache expires after a configured amount of time (default is 1 second). Data is cached while the file is open (close to open consistency). - writeback/no_writeback These options control the writeback strategy. If writeback is disabled, then normal writes will immediately be synchronized with the host fs. If writeback is enabled, then writes may be cached in the guest until the file is closed or an fsync(2) performed. This option has no effect on mmap-ed writes or writes going through the DAX mechanism. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: allow skipping control interface and forced unmountVivek Goyal2-1/+14
virtio-fs does not support aborting requests which are being processed. That is requests which have been sent to fuse daemon on host. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: dissociate DESTROY from fuseblkMiklos Szeredi2-4/+17
Allow virtio-fs to also send DESTROY request. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: delete dentry if timeout is zeroMiklos Szeredi2-3/+28
Don't hold onto dentry in lru list if need to re-lookup it anyway at next access. Only do this if explicitly enabled, otherwise it could result in performance regression. More advanced version of this patch would periodically flush out dentries from the lru which have gone stale. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: separate fuse device allocation and installation in fuse_connVivek Goyal4-7/+26
As of now fuse_dev_alloc() both allocates a fuse device and installs it in fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation fails. virtio-fs needs to initialize multiple fuse devices (one per virtio queue). It initializes one fuse device as part of call to fuse_fill_super_common() and rest of the devices are allocated and installed after that. But, we can't afford to fail after calling fuse_fill_super_common() as we don't have a way to undo all the actions done by fuse_fill_super_common(). So to avoid failures after the call to fuse_fill_super_common(), pre-allocate all fuse devices early and install them into fuse connection later. This patch provides two separate helpers for fuse device allocation and fuse device installation in fuse_conn. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: add fuse_iqueue_ops callbacksStefan Hajnoczi4-22/+81
The /dev/fuse device uses fiq->waitq and fasync to signal that requests are available. These mechanisms do not apply to virtio-fs. This patch introduces callbacks so alternative behavior can be used. Note that queue_interrupt() changes along these lines: spin_lock(&fiq->waitq.lock); wake_up_locked(&fiq->waitq); + kill_fasync(&fiq->fasync, SIGIO, POLL_IN); spin_unlock(&fiq->waitq.lock); - kill_fasync(&fiq->fasync, SIGIO, POLL_IN); Since queue_request() and queue_forget() also call kill_fasync() inside the spinlock this should be safe. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: extract fuse_fill_super_common()Stefan Hajnoczi2-59/+80
fuse_fill_super() includes code to process the fd= option and link the struct fuse_dev to the fd's struct file. In virtio-fs there is no file descriptor because /dev/fuse is not used. This patch extracts fuse_fill_super_common() so that both classic fuse and virtio-fs can share the code to initialize a mount. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_dequeue_forget() functionVivek Goyal2-6/+11
File systems like virtio-fs need to do not have to play directly with forget list data structures. There is a helper function use that instead. Rename dequeue_forget() to fuse_dequeue_forget() and export it so that stacked filesystems can use it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_get_unique()Stefan Hajnoczi2-1/+7
virtio-fs will need unique IDs for FORGET requests from outside fs/fuse/dev.c. Make the symbol visible. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_send_init_request()Vivek Goyal2-1/+3
This will be used by virtio-fs to send init request to fuse server after initialization of virt queues. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_len_args()Stefan Hajnoczi2-4/+10
virtio-fs will need to query the length of fuse_arg lists. Make the symbol visible. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_end_request()Stefan Hajnoczi2-9/+15
virtio-fs will need to complete requests from outside fs/fuse/dev.c. Make the symbol visible. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: fix request limitMiklos Szeredi1-2/+5
The size of struct fuse_req was reduced from 392B to 144B on a non-debug config, thus the sanitize_global_limit() helper was setting a larger default limit. This doesn't really reflect reduction in the memory used by requests, since the fields removed from fuse_req were added to fuse_args derived structs; e.g. sizeof(struct fuse_writepages_args) is 248B, thus resulting in slightly more memory being used for writepage requests overalll (due to using 256B slabs). Make the calculatation ignore the size of fuse_req and use the old 392B value. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: stop copying pages to fuse_reqMiklos Szeredi2-21/+6
The page array pointers are also duplicated across fuse_args_pages and fuse_req. Get rid of the fuse_req ones. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: stop copying args to fuse_reqMiklos Szeredi2-106/+34
No need to duplicate the argument arrays in fuse_req, so just dereference req->args instead of copying to the fuse_req internal ones. This allows further cleanup of the fuse_req structure. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: clean up fuse_reqMiklos Szeredi1-45/+0
Get rid of request specific fields in fuse_req that are not used anymore. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: simplify request allocationMiklos Szeredi3-59/+13
Page arrays are not allocated together with the request anymore. Get rid of the dead code Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: unexport request opsMiklos Szeredi2-81/+9
All requests are now sent with one of the fuse_simple_... helpers. Get rid of the old api from the fuse internal header. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert retrieve to simple apiMiklos Szeredi1-30/+62
Rename fuse_request_send_notify_reply() to fuse_simple_notify_reply() and convert to passing fuse_args instead of fuse_req. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert release to simple apiMiklos Szeredi2-41/+42
Since we cannot reserve the request structure up-front, make sure that the request allocation doesn't fail using __GFP_NOFAIL. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10cuse: convert init to simple apiMiklos Szeredi1-45/+48
This is a straightforward conversion. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert init to simple apiMiklos Szeredi1-28/+37
Bypass the fc->initialized check by setting the force flag. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert writepages to simple apiMiklos Szeredi3-185/+206
Derive fuse_writepage_args from fuse_io_args. Sending the request is tricky since it was done with fi->lock held, hence we must either use atomic allocation or release the lock. Both are possible so try atomic first and if it fails, release the lock and do the regular allocation with GFP_NOFS and __GFP_NOFAIL. Both flags are necessary for correct operation. Move the page realloc function from dev.c to file.c and convert to using fuse_writepage_args. The last caller of fuse_write_fill() is gone, so get rid of it. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert readdir to simple apiMiklos Szeredi3-66/+40
The old fuse_read_fill() helper can be deleted, now that the last user is gone. The fuse_io_args struct is moved to fuse_i.h so it can be shared between readdir/read code. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert readpages to simple apiMiklos Szeredi1-67/+72
Need to extend fuse_io_args with 'attr_ver' and 'ff' members, that take the functionality of the same named members in fuse_req. fuse_short_read() can now take struct fuse_args_pages. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert direct_io to simple apiMiklos Szeredi1-95/+124
Change of semantics in fuse_async_req_send/fuse_send_(read|write): these can now return error, in which case the 'end' callback isn't called, so the fuse_io_args object needs to be freed. Added verification that the return value is sane (less than or equal to the requested read/write size). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: add simple background helperMiklos Szeredi2-0/+51
Create a helper named fuse_simple_background() that is similar to fuse_simple_request(). Unlike the latter, it returns immediately and calls the supplied 'end' callback when the reply is received. The supplied 'args' pointer is stored in 'fuse_req' which allows the callback to interpret the output arguments decoded from the reply. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert sync write to simple apiMiklos Szeredi1-44/+86
Extract a fuse_write_flags() helper that converts ki_flags relevant write to open flags. The other parts of fuse_send_write() aren't used in the fuse_perform_write() case. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: covert readpage to simple apiMiklos Szeredi1-32/+48
Derive fuse_io_args from struct fuse_args_pages. This will be used for both synchronous and asynchronous read/write requests. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: fuse_short_read(): don't take fuse_req as argumentMiklos Szeredi1-8/+9
This will allow the use of this function when converting to the simple api (which doesn't use fuse_req). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert ioctl to simple apiMiklos Szeredi2-56/+45
fuse_simple_request() is converted to return length of last (instead of single) out arg, since FUSE_IOCTL_OUT has two out args, the second of which is variable length. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: move page allocMiklos Szeredi3-15/+16
fuse_req_pages_alloc() is moved to file.c, since its internal use by the device code will eventually be removed. Rename to fuse_pages_alloc() to signify that it's not only usable for fuse_req page array. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert readlink to simple apiMiklos Szeredi1-27/+25
Also turn BUG_ON into gracefully recovered WARN_ON. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: add pages to fuse_argsMiklos Szeredi2-10/+43
Derive fuse_args_pages from fuse_args. This is used to handle requests which use pages for input or output. The related flags are added to fuse_args. New FR_ALLOC_PAGES flags is added to indicate whether the page arrays in fuse_req need to be freed by fuse_put_request() or not. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert destroy to simple apiMiklos Szeredi3-22/+14
We can use the "force" flag to make sure the DESTROY request is always sent to userspace. So no need to keep it allocated during the lifetime of the filesystem. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: add nocreds to fuse_argsMiklos Szeredi2-29/+16
In some cases it makes no sense to set pid/uid/gid fields in the request header. Allow fuse_simple_background() to omit these. This is only required in the "force" case, so for now just WARN if set otherwise. Fold fuse_get_req_nofail_nopages() into its only caller. Comment is obsolete anyway. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert fuse_force_forget() to simple apiMiklos Szeredi3-24/+21
Move this function to the readdir.c where its only caller resides. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: add noreply to fuse_argsMiklos Szeredi2-1/+4
This will be used by fuse_force_forget(). We can expand fuse_request_send() into fuse_simple_request(). The FR_WAITING bit has already been set, no need to check. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert flush to simple apiMiklos Szeredi3-20/+19
Add 'force' to fuse_args and use fuse_get_req_nofail_nopages() to allocate the request in that case. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: simplify 'nofail' requestMiklos Szeredi4-69/+6
Instead of complex games with a reserved request, just use __GFP_NOFAIL. Both calers (flush, readdir) guarantee that connection was already initialized, so no need to wait for fc->initialized. Also remove unneeded clearing of FR_BACKGROUND flag. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: rearrange and resize fuse_args fieldsMiklos Szeredi2-6/+6
This makes the structure better packed. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: flatten 'struct fuse_args'Miklos Szeredi6-223/+216
...to make future expansion simpler. The hiearachical structure is a historical thing that does not serve any practical purpose. The generated code is excatly the same before and after the patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>