summaryrefslogtreecommitdiff
path: root/io_uring/io_uring.c
AgeCommit message (Collapse)AuthorFilesLines
2024-09-24io_uring: fix casts to io_req_flags_tMin-Hua Chen1-1/+1
Apply __force cast to restricted io_req_flags_t type to fix the following sparse warning: io_uring/io_uring.c:2026:23: sparse: warning: cast to restricted io_req_flags_t No functional changes intended. Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Link: https://lore.kernel.org/r/20240922104132.157055-1-minhuadotchen@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-24io_uring: fix memory leak when cache init failGuixin Liu1-1/+4
Exit the percpu ref when cache init fails to free the data memory with in struct percpu_ref. Fixes: 206aefde4f88 ("io_uring: reduce/pack size of io_ring_ctx") Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240923100512.64638-1-kanie@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-24Merge tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linuxLinus Torvalds1-3/+18
Pull more io_uring updates from Jens Axboe: "Mostly just a set of fixes in here, or little changes that didn't get included in the initial pull request. This contains: - Move the SQPOLL napi polling outside the submission lock (Olivier) - Rename of the "copy buffers" API that got added in the 6.12 merge window. There's really no copying going on, it's just referencing the buffers. After a bit of consideration, decided that it was better to simply rename this to avoid potential confusion (me) - Shrink struct io_mapped_ubuf from 48 to 32 bytes, by changing it to start + len tracking rather than having start / end in there, and by removing the caching of folio_mask when we can just calculate it from folio_shift when we need it (me) - Fixes for the SQPOLL affinity checking (me, Felix) - Fix for how cqring waiting checks for the presence of task_work. Just check it directly rather than check for a specific notification mechanism (me) - Tweak to how request linking is represented in tracing (me) - Fix a syzbot report that deliberately sets up a huge list of overflow entries, and then hits rcu stalls when flushing this list. Just check for the need to preempt, and drop/reacquire locks in the loop. There's no state maintained over the loop itself, and each entry is yanked from head-of-list (me)" * tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linux: io_uring: check if we need to reschedule during overflow flush io_uring: improve request linking trace io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNAL io_uring/sqpoll: do the napi busy poll outside the submission block io_uring: clean up a type in io_uring_register_get_file() io_uring/sqpoll: do not put cpumask on stack io_uring/sqpoll: retain test for whether the CPU is valid io_uring/rsrc: change ubuf->ubuf_end to length tracking io_uring/rsrc: get rid of io_mapped_ubuf->folio_mask io_uring: rename "copy buffers" to "clone buffers"
2024-09-20io_uring: check if we need to reschedule during overflow flushJens Axboe1-0/+15
In terms of normal application usage, this list will always be empty. And if an application does overflow a bit, it'll have a few entries. However, nothing obviously prevents syzbot from running a test case that generates a ton of overflow entries, and then flushing them can take quite a while. Check for needing to reschedule while flushing, and drop our locks and do so if necessary. There's no state to maintain here as overflows always prune from head-of-list, hence it's fine to drop and reacquire the locks at the end of the loop. Link: https://lore.kernel.org/io-uring/66ed061d.050a0220.29194.0053.GAE@google.com/ Reported-by: syzbot+5fca234bd7eb378ff78e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-20io_uring: improve request linking traceJens Axboe1-1/+1
Right now any link trace is listed as being linked after the head request in the chain, but it's more useful to note explicitly which request a given new request is chained to. Change the link trace to dump the tail request so that chains are immediately apparent when looking at traces. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-19io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNALJens Axboe1-2/+2
If some part of the kernel adds task_work that needs executing, in terms of signaling it'll generally use TWA_SIGNAL or TWA_RESUME. Those two directly translate to TIF_NOTIFY_SIGNAL or TIF_NOTIFY_RESUME, and can be used for a variety of use case outside of task_work. However, io_cqring_wait_schedule() only tests explicitly for TIF_NOTIFY_SIGNAL. This means it can miss if task_work got added for the task, but used a different kind of signaling mechanism (or none at all). Normally this doesn't matter as any task_work will be run once the task exits to userspace, except if: 1) The ring is setup with DEFER_TASKRUN 2) The local work item may generate normal task_work For condition 2, this can happen when closing a file and it's the final put of that file, for example. This can cause stalls where a task is waiting to make progress inside io_cqring_wait(), but there's nothing else that will wake it up. Hence change the "should we schedule or loop around" check to check for the presence of task_work explicitly, rather than just TIF_NOTIFY_SIGNAL as the mechanism. While in there, also change the ordering of what type of task_work first in terms of ordering, to both make it consistent with other task_work runs in io_uring, but also to better handle the case of defer task_work generating normal task_work, like in the above example. Reported-by: Jan Hendrik Farr <kernel@jfarr.cc> Link: https://github.com/axboe/liburing/issues/1235 Cc: stable@vger.kernel.org Fixes: 846072f16eed ("io_uring: mimimise io_cqring_wait_schedule") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-18Merge tag 'slab-for-6.12' of ↵Linus Torvalds1-6/+8
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: "This time it's mostly refactoring and improving APIs for slab users in the kernel, along with some debugging improvements. - kmem_cache_create() refactoring (Christian Brauner) Over the years have been growing new parameters to kmem_cache_create() where most of them are needed only for a small number of caches - most recently the rcu_freeptr_offset parameter. To avoid adding new parameters to kmem_cache_create() and adjusting all its callers, or creating new wrappers such as kmem_cache_create_rcu(), we can now pass extra parameters using the new struct kmem_cache_args. Not explicitly initialized fields default to values interpreted as unused. kmem_cache_create() is for now a wrapper that works both with the new form: kmem_cache_create(name, object_size, args, flags) and the legacy form: kmem_cache_create(name, object_size, align, flags, ctor) - kmem_cache_destroy() waits for kfree_rcu()'s in flight (Vlastimil Babka, Uladislau Rezki) Since SLOB removal, kfree() is allowed for freeing objects allocated by kmem_cache_create(). By extension kfree_rcu() as allowed as well, which can allow converting simple call_rcu() callbacks that only do kmem_cache_free(), as there was never a kmem_cache_free_rcu() variant. However, for caches that can be destroyed e.g. on module removal, the cache owners knew to issue rcu_barrier() first to wait for the pending call_rcu()'s, and this is not sufficient for pending kfree_rcu()'s due to its internal batching optimizations. Ulad has provided a new kvfree_rcu_barrier() and to make the usage less error-prone, kmem_cache_destroy() calls it. Additionally, destroying SLAB_TYPESAFE_BY_RCU caches now again issues rcu_barrier() synchronously instead of using an async work, because the past motivation for async work no longer applies. Users of custom call_rcu() callbacks should however keep calling rcu_barrier() before cache destruction. - Debugging use-after-free in SLAB_TYPESAFE_BY_RCU caches (Jann Horn) Currently, KASAN cannot catch UAFs in such caches as it is legal to access them within a grace period, and we only track the grace period when trying to free the underlying slab page. The new CONFIG_SLUB_RCU_DEBUG option changes the freeing of individual object to be RCU-delayed, after which KASAN can poison them. - Delayed memcg charging (Shakeel Butt) In some cases, the memcg is uknown at allocation time, such as receiving network packets in softirq context. With kmem_cache_charge() these may be now charged later when the user and its memcg is known. - Misc fixes and improvements (Pedro Falcato, Axel Rasmussen, Christoph Lameter, Yan Zhen, Peng Fan, Xavier)" * tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits) mm, slab: restore kerneldoc for kmem_cache_create() io_uring: port to struct kmem_cache_args slab: make __kmem_cache_create() static inline slab: make kmem_cache_create_usercopy() static inline slab: remove kmem_cache_create_rcu() file: port to struct kmem_cache_args slab: create kmem_cache_create() compatibility layer slab: port KMEM_CACHE_USERCOPY() to struct kmem_cache_args slab: port KMEM_CACHE() to struct kmem_cache_args slab: remove rcu_freeptr_offset from struct kmem_cache slab: pass struct kmem_cache_args to do_kmem_cache_create() slab: pull kmem_cache_open() into do_kmem_cache_create() slab: pass struct kmem_cache_args to create_cache() slab: port kmem_cache_create_usercopy() to struct kmem_cache_args slab: port kmem_cache_create_rcu() to struct kmem_cache_args slab: port kmem_cache_create() to struct kmem_cache_args slab: add struct kmem_cache_args slab: s/__kmem_cache_create/do_kmem_cache_create/g memcg: add charging of already allocated slab objects mm/slab: Optimize the code logic in find_mergeable() ...
2024-09-11io_uring/cmd: expose iowq to cmdsPavel Begunkov1-0/+11
When an io_uring request needs blocking context we offload it to the io_uring's thread pool called io-wq. We can get there off ->uring_cmd by returning -EAGAIN, but there is no straightforward way of doing that from an asynchronous callback. Add a helper that would transfer a command to a blocking context. Note, we do an extra hop via task_work before io_queue_iowq(), that's a limitation of io_uring infra we have that can likely be lifted later if that would ever become a problem. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f735f807d7c8ba50c9452c69dfe5d3e9e535037b.1726072086.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10io_uring: port to struct kmem_cache_argsChristian Brauner1-6/+8
Port req_cachep to struct kmem_cache_args. Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-29io_uring/kbuf: pass in 'len' argument for buffer commitJens Axboe1-1/+1
In preparation for needing the consumed length, pass in the length being completed. Unused right now, but will be used when it is possible to partially consume a buffer. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring: wire up min batch wake timeoutJens Axboe1-4/+4
Expose min_wait_usec in io_uring_getevents_arg, replacing the pad member that is currently in there. The value is in usecs, which is explained in the name as well. Note that if min_wait_usec and a normal timeout is used in conjunction, the normal timeout is still relative to the base time. For example, if min_wait_usec is set to 100 and the normal timeout is 1000, the max total time waited is still 1000. This also means that if the normal timeout is shorter than min_wait_usec, then only the min_wait_usec will take effect. See previous commit for an explanation of how this works. IORING_FEAT_MIN_TIMEOUT is added as a feature flag for this, as applications doing submit_and_wait_timeout() style operations will generally not see the -EINVAL from the wait side as they return the number of IOs submitted. Only if no IOs are submitted will the -EINVAL bubble back up to the application. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring: add support for batch wait timeoutJens Axboe1-13/+80
Waiting for events with io_uring has two knobs that can be set: 1) The number of events to wake for 2) The timeout associated with the event Waiting will abort when either of those conditions are met, as expected. This adds support for a third event, which is associated with the number of events to wait for. Applications generally like to handle batches of completions, and right now they'd set a number of events to wait for and the timeout for that. If no events have been received but the timeout triggers, control is returned to the application and it can wait again. However, if the application doesn't have anything to do until events are reaped, then it's possible to make this waiting more efficient. For example, the application may have a latency time of 50 usecs and wanting to handle a batch of 8 requests at the time. If it uses 50 usecs as the timeout, then it'll be doing 20K context switches per second even if nothing is happening. This introduces the notion of min batch wait time. If the min batch wait time expires, then we'll return to userspace if we have any events at all. If none are available, the general wait time is applied. Any request arriving after the min batch wait time will cause waiting to stop and return control to the application. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring: implement our own schedule timeout handlingJens Axboe1-4/+31
In preparation for having two distinct timeouts and avoid waking the task if we don't need to. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring: move schedule wait logic into helperJens Axboe1-16/+21
In preparation for expanding how we handle waits, move the actual schedule and schedule_timeout() handling into a helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring: encapsulate extraneous wait flags into a separate structJens Axboe1-21/+24
Rather than need to pass in 2 or 3 separate arguments, add a struct to encapsulate the timeout and sigset_t parts of waiting. In preparation for adding another argument for waiting. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring: user registered clockid for wait timeoutsPavel Begunkov1-2/+6
Add a new registration opcode IORING_REGISTER_CLOCK, which allows the user to select which clock id it wants to use with CQ waiting timeouts. It only allows a subset of all posix clocks and currently supports CLOCK_MONOTONIC and CLOCK_BOOTTIME. Suggested-by: Lewis Baker <lewissbaker@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/98f2bc8a3c36cdf8f0e6a275245e81e903459703.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring: add absolute mode wait timeoutsPavel Begunkov1-7/+8
In addition to current relative timeouts for the waiting loop, where the timespec argument specifies the maximum time it can wait for, add support for the absolute mode, with the value carrying a CLOCK_MONOTONIC absolute time until which we should return control back to the user. Suggested-by: Lewis Baker <lewissbaker@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d5b74d67ada882590b2e42aa3aa7117bbf6b55f.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring/napi: postpone napi timeout adjustmentPavel Begunkov1-1/+0
Remove io_napi_adjust_timeout() and move the adjustments out of the common path into __io_napi_busy_loop(). Now the limit it's calculated based on struct io_wait_queue::timeout, for which we query current time another time. The overhead shouldn't be a problem, it's a polling path, however that can be optimised later by additionally saving the delta time value in io_cqring_wait(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/88e14686e245b3b42ff90a3c4d70895d48676206.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-26io_uring/napi: pass ktime to io_napi_adjust_timeoutPavel Begunkov1-2/+4
Pass the waiting time for __io_napi_adjust_timeout as ktime and get rid of all timespec64 conversions. It's especially simpler since the caller already have a ktime. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4f5b8e8eed4f53a1879e031a6712b25381adc23d.1722003776.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24io_uring: align iowq and task request error handlingPavel Begunkov1-1/+1
There is a difference in how io_queue_sqe and io_wq_submit_work treat error codes they get from io_issue_sqe. The first one fails anything unknown but latter only fails when the code is negative. It doesn't make sense to have this discrepancy, align them to the io_queue_sqe behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c550e152bf4a290187f91a4322ddcb5d6d1f2c73.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24io_uring: tighten task exit cancellationsPavel Begunkov1-1/+4
io_uring_cancel_generic() should retry if any state changes like a request is completed, however in case of a task exit it only goes for another loop and avoids schedule() if any tracked (i.e. REQ_F_INFLIGHT) request got completed. Let's assume we have a non-tracked request executing in iowq and a tracked request linked to it. Let's also assume io_uring_cancel_generic() fails to find and cancel the request, i.e. via io_run_local_work(), which may happen as io-wq has gaps. Next, the request logically completes, io-wq still hold a ref but queues it for completion via tw, which happens in io_uring_try_cancel_requests(). After, right before prepare_to_wait() io-wq puts the request, grabs the linked one and tries executes it, e.g. arms polling. Finally the cancellation loop calls prepare_to_wait(), there are no tw to run, no tracked request was completed, so the tctx_inflight() check passes and the task is put to indefinite sleep. Cc: stable@vger.kernel.org Fixes: 3f48cf18f886c ("io_uring: unify files and task cancel") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/acac7311f4e02ce3c43293f8f1fda9c705d158f1.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-15Merge tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linuxLinus Torvalds1-95/+55
Pull io_uring updates from Jens Axboe: "Here are the io_uring updates queued up for 6.11. Nothing major this time around, various minor improvements and cleanups/fixes. This contains: - Add bind/listen opcodes. Main motivation is to support direct descriptors, to avoid needing a regular fd just for doing these two operations (Gabriel) - Probe fixes (Gabriel) - Treat io-wq work flags as atomics. Not fixing a real issue, but may as well and it silences a KCSAN warning (me) - Cleanup of rsrc __set_current_state() usage (me) - Add 64-bit for {m,f}advise operations (me) - Improve performance of data ring messages (me) - Fix for ring message overflow posting (Pavel) - Fix for freezer interaction with TWA_NOTIFY_SIGNAL. Not strictly an io_uring thing, but since TWA_NOTIFY_SIGNAL was originally added for faster task_work signaling for io_uring, bundling it with this pull (Pavel) - Add Pavel as a co-maintainer - Various cleanups (me, Thorsten)" * tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux: (28 commits) io_uring/net: check socket is valid in io_bind()/io_listen() kernel: rerun task_work while freezing in get_signal() io_uring/io-wq: limit retrying worker initialisation io_uring/napi: Remove unnecessary s64 cast io_uring/net: cleanup io_recv_finish() bundle handling io_uring/msg_ring: fix overflow posting MAINTAINERS: change Pavel Begunkov from io_uring reviewer to maintainer io_uring/msg_ring: use kmem_cache_free() to free request io_uring/msg_ring: check for dead submitter task io_uring/msg_ring: add an alloc cache for io_kiocb entries io_uring/msg_ring: improve handling of target CQE posting io_uring: add io_add_aux_cqe() helper io_uring: add remote task_work execution helper io_uring/msg_ring: tighten requirement for remote posting io_uring: Allocate only necessary memory in io_probe io_uring: Fix probe of disabled operations io_uring: Introduce IORING_OP_LISTEN io_uring: Introduce IORING_OP_BIND net: Split a __sys_listen helper for io_uring net: Split a __sys_bind helper for io_uring ...
2024-07-02io_uring/msg_ring: fix overflow postingPavel Begunkov1-1/+5
The caller of io_cqring_event_overflow() should be holding the completion_lock, which is violated by io_msg_tw_complete. There is only one caller of io_add_aux_cqe(), so just add locking there for now. WARNING: CPU: 0 PID: 5145 at io_uring/io_uring.c:703 io_cqring_event_overflow+0x442/0x660 io_uring/io_uring.c:703 RIP: 0010:io_cqring_event_overflow+0x442/0x660 io_uring/io_uring.c:703 <TASK> __io_post_aux_cqe io_uring/io_uring.c:816 [inline] io_add_aux_cqe+0x27c/0x320 io_uring/io_uring.c:837 io_msg_tw_complete+0x9d/0x4d0 io_uring/msg_ring.c:78 io_fallback_req_func+0xce/0x1c0 io_uring/io_uring.c:256 process_one_work kernel/workqueue.c:3224 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3305 worker_thread+0x86d/0xd40 kernel/workqueue.c:3383 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:144 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Fixes: f33096a3c99c0 ("io_uring: add io_add_aux_cqe() helper") Reported-by: syzbot+f7f9c893345c5c615d34@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c7350d07fefe8cce32b50f57665edbb6355ea8c1.1719927398.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24io_uring: signal SQPOLL task_work with TWA_SIGNAL_NO_IPIJens Axboe1-2/+2
Before SQPOLL was transitioned to managing its own task_work, the core used TWA_SIGNAL_NO_IPI to ensure that task_work was processed. If not, we can't be sure that all task_work is processed at SQPOLL thread exit time. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24io_uring/msg_ring: add an alloc cache for io_kiocb entriesJens Axboe1-0/+6
With slab accounting, allocating and freeing memory has considerable overhead. Add a basic alloc cache for the io_kiocb allocations that msg_ring needs to do. Unlike other caches, this one is used by the sender, grabbing it from the remote ring. When the remote ring gets the posted completion, it'll free it locally. Hence it is separately locked, using ctx->msg_lock. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24io_uring: add io_add_aux_cqe() helperJens Axboe1-2/+21
This helper will post a CQE, and can be called from task_work where we now that the ctx is already properly locked and that deferred completions will get flushed later on. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24io_uring: add remote task_work execution helperJens Axboe1-8/+16
All our task_work handling is targeted at the state in the io_kiocb itself, which is what it is being used for. However, MSG_RING rolls its own task_work handling, ignoring how that is usually done. In preparation for switching MSG_RING to be able to use the normal task_work handling, add io_req_task_work_add_remote() which allows the caller to pass in the target io_ring_ctx. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16io_uring/io-wq: make io_wq_work flags atomicJens Axboe1-6/+6
The work flags can be set/accessed from different tasks, both the originator of the request, and the io-wq workers. While modifications aren't concurrent, it still makes KMSAN unhappy. There's no real downside to just making the flag reading/manipulation use proper atomics here. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16io_uring: use 'state' consistentlyJens Axboe1-1/+1
__io_submit_flush_completions() assigns ctx->submit_state to a local variable and uses it in all but one spot, switch that forgotten statement to using 'state' as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16io_uring/eventfd: move eventfd handling to separate fileJens Axboe1-81/+1
This is pretty nicely abstracted already, but let's move it to a separate file rather than have it in the main io_uring file. With that, we can also move the io_ev_fd struct and enum out of global scope. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16io_uring/eventfd: move to more idiomatic RCU free usageJens Axboe1-23/+26
In some ways, it just "happens to work" currently with using the ops field for both the free and signaling bit. But it depends on ordering of operations in terms of freeing and signaling. Clean it up and use the usual refs == 0 under RCU read side lock to determine if the ev_fd is still valid, and use the reference to gate the freeing as well. Fixes: 21a091b970cd ("io_uring: signal registered eventfd to process deferred task work") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-13io_uring: fix cancellation overwriting req->flagsPavel Begunkov1-0/+1
Only the current owner of a request is allowed to write into req->flags. Hence, the cancellation path should never touch it. Add a new field instead of the flag, move it into the 3rd cache line because it should always be initialised. poll_refs can move further as polling is an involved process anyway. It's a minimal patch, in the future we can and should find a better place for it and remove now unused REQ_F_CANCEL_SEQ. Fixes: 521223d7c229f ("io_uring/cancel: don't default to setting req->work.cancel_seq") Cc: stable@vger.kernel.org Reported-by: Li Shi <sl1589472800@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6827b129f8f0ad76fa9d1f0a773de938b240ffab.1718323430.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-22io_uring: remove checks for NULL 'sq_offset'Jens Axboe1-4/+2
Since the 5.12 kernel release, nobody has been passing NULL as the sq_offset pointer. Remove the checks for it being NULL or not, it will always be valid. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-13Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linuxLinus Torvalds1-544/+121
Pull io_uring updates from Jens Axboe: - Greatly improve send zerocopy performance, by enabling coalescing of sent buffers. MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the io_uring side did not. In local testing, the crossover point for send zerocopy being faster is now around 3000 byte packets, and it performs better than the sync syscall variants as well. This feature relies on a shared branch with net-next, which was pulled into both branches. - Unification of how async preparation is done across opcodes. Previously, opcodes that required extra memory for async retry would allocate that as needed, using on-stack state until that was the case. If async retry was needed, the on-stack state was adjusted appropriately for a retry and then copied to the allocated memory. This led to some fragile and ugly code, particularly for read/write handling, and made storage retries more difficult than they needed to be. Allocate the memory upfront, as it's cheap from our pools, and use that state consistently both initially and also from the retry side. - Move away from using remap_pfn_range() for mapping the rings. This is really not the right interface to use and can cause lifetime issues or leaks. Additionally, it means the ring sq/cq arrays need to be physically contigious, which can cause problems in production with larger rings when services are restarted, as memory can be very fragmented at that point. Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply the same treatment to mapped ring provided buffers. This also helps unify the code we have dealing with allocating and mapping memory. Hard to see in the diffstat as we're adding a few features as well, but this kills about ~400 lines of code from the codebase as well. - Add support for bundles for send/recv. When used with provided buffers, bundles support sending or receiving more than one buffer at the time, improving the efficiency by only needing to call into the networking stack once for multiple sends or receives. - Tweaks for our accept operations, supporting both a DONTWAIT flag for skipping poll arm and retry if we can, and a POLLFIRST flag that the application can use to skip the initial accept attempt and rely purely on poll for triggering the operation. Both of these have identical flags on the receive side already. - Make the task_work ctx locking unconditional. We had various code paths here that would do a mix of lock/trylock and set the task_work state to whether or not it was locked. All of that goes away, we lock it unconditionally and get rid of the state flag indicating whether it's locked or not. The state struct still exists as an empty type, can go away in the future. - Add support for specifying NOP completion values, allowing it to be used for error handling testing. - Use set/test bit for io-wq worker flags. Not strictly needed, but also doesn't hurt and helps silence a KCSAN warning. - Cleanups for io-wq locking and work assignments, closing a tiny race where cancelations would not be able to find the work item reliably. - Misc fixes, cleanups, and improvements * tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits) io_uring: support to inject result for NOP io_uring: fail NOP if non-zero op flags is passed in io_uring/net: add IORING_ACCEPT_POLL_FIRST flag io_uring/net: add IORING_ACCEPT_DONTWAIT flag io_uring/filetable: don't unnecessarily clear/reset bitmap io_uring/io-wq: Use set_bit() and test_bit() at worker->flags io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring io_uring: Require zeroed sqe->len on provided-buffers send io_uring/notif: disable LAZY_WAKE for linked notifs io_uring/net: fix sendzc lazy wake polling io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it io_uring/rw: reinstate thread check for retries io_uring/notif: implement notification stacking io_uring/notif: simplify io_notif_flush() net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure io_uring/net: support bundles for recv io_uring/net: support bundles for send io_uring/kbuf: add helpers for getting/peeking multiple buffers io_uring/net: add provided buffer support for IORING_OP_SEND ...
2024-05-13Merge tag 'vfs-6.10.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That means we now have six free FMODE_* bits in total (but bit #6 already got used for FMODE_WRITE_RESTRICTED) - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup) - Add fd_raw cleanup class so we can make use of automatic cleanup provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well - Optimize seq_puts() - Simplify __seq_puts() - Add new anon_inode_getfile_fmode() api to allow specifying f_mode instead of open-coding it in multiple places - Annotate struct file_handle with __counted_by() and use struct_size() - Warn in get_file() whether f_count resurrection from zero is attempted (epoll/drm discussion) - Folio-sophize aio - Export the subvolume id in statx() for both btrfs and bcachefs - Relax linkat(AT_EMPTY_PATH) requirements - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors for dup*() equality replacing kcmp() Cleanups: - Compile out swapfile inode checks when swap isn't enabled - Use (1 << n) notation for FMODE_* bitshifts for clarity - Remove redundant variable assignment in fs/direct-io - Cleanup uses of strncpy in orangefs - Speed up and cleanup writeback - Move fsparam_string_empty() helper into header since it's currently open-coded in multiple places - Add kernel-doc comments to proc_create_net_data_write() - Don't needlessly read dentry->d_flags twice Fixes: - Fix out-of-range warning in nilfs2 - Fix ecryptfs overflow due to wrong encryption packet size calculation - Fix overly long line in xfs file_operations (follow-up to FMODE_* cleanup) - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up to FMODE_* cleanup) - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_* cleanup) - Fix stable offset api to prevent endless loops - Fix afs file server rotations - Prevent xattr node from overflowing the eraseblock in jffs2 - Move fdinfo PTRACE_MODE_READ procfs check into the .permission() operation instead of .open() operation since this caused userspace regressions" * tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) afs: Fix fileserver rotation getting stuck selftests: add F_DUPDFD_QUERY selftests fcntl: add F_DUPFD_QUERY fcntl() file: add fd_raw cleanup class fs: WARN when f_count resurrection is attempted seq_file: Simplify __seq_puts() seq_file: Optimize seq_puts() proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation fs: Create anon_inode_getfile_fmode() xfs: don't call xfs_file_open from xfs_dir_open xfs: drop fop_flags for directories xfs: fix overly long line in the file_operations shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() jffs2: prevent xattr node from overflowing the eraseblock vfs, swap: compile out IS_SWAPFILE() on swapless configs vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements fs/direct-io: remove redundant assignment to variable retval fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading ...
2024-04-25io_uring/rw: reinstate thread check for retriesJens Axboe1-13/+0
Allowing retries for everything is arguably the right thing to do, now that every command type is async read from the start. But it's exposed a few issues around missing check for a retry (which cca6571381a0 exposed), and the fixup commit for that isn't necessarily 100% sound in terms of iov_iter state. For now, just revert these two commits. This unfortunately then re-opens the fact that -EAGAIN can get bubbled to userspace for some cases where the kernel very well could just sanely retry them. But until we have all the conditions covered around that, we cannot safely enable that. This reverts commit df604d2ad480fcf7b39767280c9093e13b1de952. This reverts commit cca6571381a0bdc88021a1f7a4c2349df21279f7. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-22io_uring/net: support bundles for recvJens Axboe1-1/+2
If IORING_OP_RECV is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer recv. This grabs buffers available and receives into them, posting a single completion for all of it. This can be used with multishot receive as well, or without it. Now that both send and receive support bundles, add a feature flag for it as well. If IORING_FEAT_RECVSEND_BUNDLE is set after registering the ring, then the kernel supports bundles for recv and send. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17io_uring/rw: ensure retry condition isn't lostJens Axboe1-0/+13
A previous commit removed the checking on whether or not it was possible to retry a request, since it's now possible to retry any of them. This would previously have caused the request to have been ended with an error, but now the retry condition can simply get lost instead. Cleanup the retry handling and always just punt it to task_work, which will queue it with io-wq appropriately. Reported-by: Changhui Zhong <czhong@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Fixes: cca6571381a0 ("io_uring/rw: cleanup retry path") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: ensure overflow entries are dropped when ring is exitingJens Axboe1-1/+2
A previous consolidation cleanup missed handling the case where the ring is dying, and __io_cqring_overflow_flush() doesn't flush entries if the CQ ring is already full. This is fine for the normal CQE overflow flushing, but if the ring is going away, we need to flush everything, even if it means simply freeing the overflown entries. Fixes: 6c948ec44b29 ("io_uring: consolidate overflow flushing") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: consolidate overflow flushingPavel Begunkov1-25/+15
Consolidate __io_cqring_overflow_flush and io_cqring_overflow_kill() into a single function as it once was, it's easier to work with it this way. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/986b42c35e76a6be7aa0cdcda0a236a2222da3a7.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: always lock __io_cqring_overflow_flushPavel Begunkov1-5/+8
Conditional locking is never great, in case of __io_cqring_overflow_flush(), which is a slow path, it's not justified. Don't handle IOPOLL separately, always grab uring_lock for overflow flushing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/162947df299aa12693ac4b305dacedab32ec7976.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: open code io_cqring_overflow_flush()Pavel Begunkov1-8/+3
There is only one caller of io_cqring_overflow_flush(), open code it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a1fecd56d9dba923ed8d4d159727fa939d3baa2a.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: remove extra SQPOLL overflow flushPavel Begunkov1-2/+0
c1edbf5f081be ("io_uring: flag SQPOLL busy condition to userspace") added an extra overflowed CQE flush in the SQPOLL submission path due to backpressure, which was later removed. Remove the flush and let io_cqring_wait() / iopoll handle it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a83b0724ca6ca9d16c7d79a51b77c81876b2e39.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: unexport io_req_cqe_overflow()Pavel Begunkov1-1/+1
There are no users of io_req_cqe_overflow() apart from io_uring.c, make it static. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f4295eb2f9eb98d5db38c0578f57f0b86bfe0d8c.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: return void from io_put_kbuf_comp()Ming Lei1-1/+1
The only caller doesn't handle the return value of io_put_kbuf_comp(), so change its return type into void. Also follow Jens's suggestion to rename it as io_put_kbuf_drop(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240407132759.4056167-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: remove io_req_put_rsrc_locked()Pavel Begunkov1-3/+2
io_req_put_rsrc_locked() is a weird shim function around io_req_put_rsrc(). All calls to io_req_put_rsrc() require holding ->uring_lock, so we can just use it directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a195bc78ac3d2c6fbaea72976e982fe51e50ecdd.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: remove async request cachePavel Begunkov1-22/+0
io_req_complete_post() was a sole user of ->locked_free_list, but since we just gutted the function, the cache is not used anymore and can be removed. ->locked_free_list served as an asynhronous counterpart of the main request (i.e. struct io_kiocb) cache for all unlocked cases like io-wq. Now they're all forced to be completed into the main cache directly, off of the normal completion path or via io_free_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7bffccd213e370abd4de480e739d8b08ab6c1326.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: turn implicit assumptions into a warningPavel Begunkov1-1/+11
io_req_complete_post() is now io-wq only and shouldn't be used outside of it, i.e. it relies that io-wq holds a ref for the request as explained in a comment below. Let's add a warning to enforce the assumption and make sure nobody would try to do anything weird. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1013b60c35d431d0698cafbc53c06f5917348c20.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: kill dead code in io_req_complete_postMing Lei1-35/+2
Since commit 8f6c829491fe ("io_uring: remove struct io_tw_state::locked"), io_req_complete_post() is only called from io-wq submit work, where the request reference is guaranteed to be grabbed and won't drop to zero in io_req_complete_post(). Kill the dead code, meantime add req_ref_put() to put the reference. Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1d8297e2046553153e763a52574f0e0f4d512f86.1712331455.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: move mapping/allocation helpers to a separate fileJens Axboe1-325/+2
Move the related code from io_uring.c into memmap.c. No functional changes in this patch, just cleaning it up a bit now that the full transition is done. Signed-off-by: Jens Axboe <axboe@kernel.dk>