aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-03-17io_uring: manage provided buffers strictly orderedJens Axboe1-62/+92
Workloads using provided buffers benefit from using and returning buffers in the right order, and so does TLBs for that matter. Manage the internal buffer list in a straight list, rather than use the head buffer as the insertion node. Use a hashed list for the buffer group IDs instead of xarray, the overhead is much lower this way. xarray provides internal locking and other trickery that is handy for some uses cases, but io_uring already locks internally for the buffer manipulation and needs none of that. This is good for about a 2% reduction in overhead, combination of the improved management and the fact that the workload has an easier time bundling back provided buffers. Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: fold evfd signalling under a slower pathPavel Begunkov1-26/+34
Add ->has_evfd flag, which is true IFF there is an eventfd attached, and use it to hide io_eventfd_signal() into __io_commit_cqring_flush() and combine fast checks in a single if. Also, gcc 11.2 wasn't inlining io_cqring_ev_posted() without this change, so helps with that as well. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/f6168471997decded475a063f92915787975a30b.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: thin down io_commit_cqring()Pavel Begunkov1-8/+15
io_commit_cqring() is currently always under spinlock section, so it's always better to keep it as slim as possible. Move __io_commit_cqring_flush() out of it into ev_posted*(). If fast checks do fail and this post-processing is required, we'll reacquire ->completion_lock, which is fine as we don't care about performance of draining and offset timeouts. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/ec4e81fd720d3bc7bca8cb9152e080dad1a052f1.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: shuffle io_eventfd_signal() bits aroundPavel Begunkov1-8/+5
A preparation patch, which moves a fast ->io_ev_fd check out of io_eventfd_signal() into ev_posted*(). Compilers are smart enough for it to not change anything, but will need it later. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/ec4091ac76d43912b73917e8db651c2dac4b7b01.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: remove extra barrier for non-sqpoll iopollPavel Begunkov1-4/+1
smp_mb() in io_cqring_ev_posted_iopoll() is only there because of waitqueue_active(). However, non-SQPOLL IOPOLL ring doesn't wake the CQ and so the barrier there is useless. Kill it, it's usually pretty expensive. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/d72e8ef6f7a3f6a72e18fad8409f7d47afc8da7d.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: fix provided buffer return on failure for kiocb_done()Pavel Begunkov1-7/+3
Use io_req_complete_failed() in kiocb_done(). This cleans up the code, but also ensures that a provided buffers is correctly freed on failure. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/a4880106fcf199d5810707fe2d17126fcdf18bc4.1647481208.git.asml.silence@gmail.com [axboe: split from previous patch] Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: extend provided buf return to failsPavel Begunkov1-1/+1
It's never a good idea to put provided buffers without notifying the userspace, it'll lead to userspace leaks, so add io_put_kbuf() in io_req_complete_failed(). The fail helper is called by all sorts of requests, but it's still safe to do as io_put_kbuf() will return 0 in for all requests that don't support and so don't expect provided buffers. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/a4880106fcf199d5810707fe2d17126fcdf18bc4.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: refactor timeout cancellation cqe postingPavel Begunkov1-4/+1
io_fill_cqe*() is not always the best way to post CQEs just because there is enough of infrastructure on top. Replace a raw call to a variant of it inside of io_timeout_cancel(), which also saves us some bloating and might help with batching later. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/46113ec4345764b4aef3b384ce38cceabaeedcbb.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: normilise naming for fill_cqe*Pavel Begunkov1-8/+8
Restore consistency in __io_fill_cqe* like helpers, always honouring "io_" prefix and adding "req" when we're passing in a request. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/bd016ff5c1a4f74687828069d2619d8a65e0c6d7.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: cache poll/double-poll state with a request flagJens Axboe1-5/+19
With commit "io_uring: cache req->apoll->events in req->cflags" applied, we now have just io_poll_remove_entries() dipping into req->apoll when it isn't strictly necessary. Mark poll and double-poll with a flag, so we know if we need to look at apoll->double_poll. This avoids pulling in those cachelines if we don't need them. The common case is that the poll wake handler already removed these entries while hot off the completion path. Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: cache req->apoll->events in req->cflagsJens Axboe1-12/+19
When we arm poll on behalf of a different type of request, like a network receive, then we allocate req->apoll as our poll entry. Running network workloads shows io_poll_check_events() as the most expensive part of io_uring, and it's all due to having to pull in req->apoll instead of just the request which we have hot already. Cache poll->events in req->cflags, which isn't used until the request completes anyway. This isn't strictly needed for regular poll, where req->poll.events is used and thus already hot, but for the sake of unification we do it all around. This saves 3-4% of overhead in certain request workloads. Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: move req->poll_refs into previous struct holeJens Axboe1-3/+3
This serves two purposes: - We now have the last cacheline mostly unused for generic workloads, instead of having to pull in the poll refs explicitly for workloads that rely on poll arming. - It shrinks the io_kiocb from 232 to 224 bytes. Signed-off-by: Jens Axboe <[email protected]>
2022-03-16io_uring: make tracing format consistentDylan Yudaken1-12/+12
Make the tracing formatting for user_data and flags consistent. Having consistent formatting allows one for example to grep for a specific user_data/flags and be able to trace a single sqe through easily. Change user_data to 0x%llx and flags to 0x%x everywhere. The '0x' is useful to disambiguate for example "user_data 100". Additionally remove the '=' for flags in io_uring_req_failed, again for consistency. Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-15io_uring: recycle apoll_poll entriesJens Axboe1-6/+37
Particularly for networked workloads, io_uring intensively uses its poll based backend to get a notification when data/space is available. Profiling workloads, we see 3-4% of alloc+free that is directly attributed to just the apoll allocation and free (and the rest being skb alloc+free). For the fast path, we have ctx->uring_lock held already for both issue and the inline completions, and we can utilize that to avoid any extra locking needed to have a basic recycling cache for the apoll entries on both the alloc and free side. Double poll still requires an allocation. But those are rare and not a fast path item. With the simple cache in place, we see a 3-4% reduction in overhead for the workload. Signed-off-by: Jens Axboe <[email protected]>
2022-03-12io_uring: remove duplicated member check for io_msg_ring_prep()Jens Axboe1-3/+2
Julia and the kernel test robot report that the prep handling for this command inadvertently checks one field twice: fs/io_uring.c:4338:42-56: duplicated argument to && or || Get rid of it. Reported-by: kernel test robot <[email protected]> Reported-by: Julia Lawall <[email protected]> Fixes: 4f57f06ce218 ("io_uring: add support for IORING_OP_MSG_RING command") Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: allow submissions to continue on errorJens Axboe2-3/+10
By default, io_uring will stop submitting a batch of requests if we run into an error submitting a request. This isn't strictly necessary, as the error result is passed out-of-band via a CQE anyway. And it can be a bit confusing for some applications. Provide a way to setup a ring that will continue submitting on error, when the error CQE has been posted. There's still one case that will break out of submission. If we fail allocating a request, then we'll still return -ENOMEM. We could in theory post a CQE for that condition too even if we never got a request. Leave that for a potential followup. Reported-by: Dylan Yudaken <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: recycle provided buffers if request goes asyncJens Axboe1-0/+36
If we are using provided buffers, it's less than useful to have a buffer selected and pinned if a request needs to go async or arms poll for notification trigger on when we can process it. Recycle the buffer in those events, so we don't pin it for the duration of the request. Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: ensure reads re-import for selected buffersJens Axboe1-0/+10
If we drop buffers between scheduling a retry, then we need to re-import when we start the request again. Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: retry early for reads if we can pollJens Axboe1-0/+3
Most of the logic in io_read() deals with regular files, and in some ways it would make sense to split the handling into S_IFREG and others. But at least for retry, we don't need to bother setting up a bunch of state just to abort in the loop later. In particular, don't bother forcing setup of async data for a normal non-vectored read when we don't need it. Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: Add support for napi_busy_pollOlivier Langlois1-1/+231
The sqpoll thread can be used for performing the napi busy poll in a similar way that it does io polling for file systems supporting direct access bypassing the page cache. The other way that io_uring can be used for napi busy poll is by calling io_uring_enter() to get events. If the user specify a timeout value, it is distributed between polling and sleeping by using the systemwide setting /proc/sys/net/core/busy_poll. The changes have been tested with this program: https://github.com/lano1106/io_uring_udp_ping and the result is: Without sqpoll: NAPI busy loop disabled: rtt min/avg/max/mdev = 40.631/42.050/58.667/1.547 us NAPI busy loop enabled: rtt min/avg/max/mdev = 30.619/31.753/61.433/1.456 us With sqpoll: NAPI busy loop disabled: rtt min/avg/max/mdev = 42.087/44.438/59.508/1.533 us NAPI busy loop enabled: rtt min/avg/max/mdev = 35.779/37.347/52.201/0.924 us Co-developed-by: Hao Xu <[email protected]> Signed-off-by: Hao Xu <[email protected]> Signed-off-by: Olivier Langlois <[email protected]> Link: https://lore.kernel.org/r/810bd9408ffc510ff08269e78dca9df4af0b9e4e.1646777484.git.olivier@trillion01.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: minor io_cqring_wait() optimizationOlivier Langlois1-8/+8
Move up the block manipulating the sig variable to execute code that may encounter an error and exit first before continuing executing the rest of the function and avoid useless computations Signed-off-by: Olivier Langlois <[email protected]> Link: https://lore.kernel.org/r/84513f7cc1b1fb31d8f4cb910aee033391d036b4.1646777484.git.olivier@trillion01.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: add support for IORING_OP_MSG_RING commandJens Axboe2-0/+58
This adds support for IORING_OP_MSG_RING, which allows an SQE to signal another ring. That allows either waking up someone waiting on the ring, or even passing a 64-bit value via the user_data field in the CQE. sqe->fd must contain the fd of a ring that should receive the CQE. sqe->off will be propagated to the cqe->user_data on the target ring, and sqe->len will be propagated to cqe->res. The results CQE will have IORING_CQE_F_MSG set in its flags, to indicate that this CQE was generated from a messaging request rather than a SQE issued locally on that ring. This effectively allows passing a 64-bit and a 32-bit quantify between the two rings. This request type has the following request specific error cases: - -EBADFD. Set if the sqe->fd doesn't point to a file descriptor that is of the io_uring type. - -EOVERFLOW. Set if we were not able to deliver a request to the target ring. Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: speedup provided buffer handlingJens Axboe1-22/+117
In testing high frequency workloads with provided buffers, we spend a lot of time in allocating and freeing the buffer units themselves. Rather than repeatedly free and alloc them, add a recycling cache instead. There are two caches: - ctx->io_buffers_cache. This is the one we grab from in the submission path, and it's protected by ctx->uring_lock. For inline completions, we can recycle straight back to this cache and not need any extra locking. - ctx->io_buffers_comp. If we're not under uring_lock, then we use this list to recycle buffers. It's protected by the completion_lock. On adding a new buffer, check io_buffers_cache. If it's empty, check if we can splice entries from the io_buffers_comp_cache. This reduces about 5-10% of overhead from provided buffers, bringing it pretty close to the non-provided path. Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: add support for registering ring file descriptorsJens Axboe3-10/+190
Lots of workloads use multiple threads, in which case the file table is shared between them. This makes getting and putting the ring file descriptor for each io_uring_enter(2) system call more expensive, as it involves an atomic get and put for each call. Similarly to how we allow registering normal file descriptors to avoid this overhead, add support for an io_uring_register(2) API that allows to register the ring fds themselves: 1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update structs, and registers them with the task. 2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update structs, and unregisters them. When a ring fd is registered, it is internally represented by an offset. This offset is returned to the application, and the application then uses this offset and sets IORING_ENTER_REGISTERED_RING for the io_uring_enter(2) system call. This works just like using a registered file descriptor, rather than a real one, in an SQE, where IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal offset/descriptor rather than a real file descriptor. In initial testing, this provides a nice bump in performance for threaded applications in real world cases where the batch count (eg number of requests submitted per io_uring_enter(2) invocation) is low. In a microbenchmark, submitting NOP requests, we see the following increases in performance: Requests per syscall Baseline Registered Increase ---------------------------------------------------------------- 1 ~7030K ~8080K +15% 2 ~13120K ~14800K +13% 4 ~22740K ~25300K +11% Co-developed-by: Xiaoguang Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: documentation fixupDylan Yudaken1-1/+1
Fix incorrect name reference in comment. ki_filp does not exist in the struct, but file does. Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: do not recalculate ppos unnecessarilyDylan Yudaken1-6/+12
There is a slight optimisation to be had by calculating the correct pos pointer inside io_kiocb_update_pos and then using that later. It seems code size drops by a bit: 000000000000a1b0 0000000000000400 t io_read 000000000000a5b0 0000000000000319 t io_write vs 000000000000a1b0 00000000000003f6 t io_read 000000000000a5b0 0000000000000310 t io_write Signed-off-by: Dylan Yudaken <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: update kiocb->ki_pos at execution timeDylan Yudaken1-8/+18
Update kiocb->ki_pos at execution time rather than in io_prep_rw(). io_prep_rw() happens before the job is enqueued to a worker and so the offset might be read multiple times before being executed once. Ensures that the file position in a set of _linked_ SQEs will be only obtained after earlier SQEs have completed, and so will include their incremented file position. Signed-off-by: Dylan Yudaken <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: remove duplicated calls to io_kiocb_pposDylan Yudaken1-2/+5
io_kiocb_ppos is called in both branches, and it seems that the compiler does not fuse this. Fusing removes a few bytes from loop_rw_iter. Before: $ nm -S fs/io_uring.o | grep loop_rw_iter 0000000000002430 0000000000000124 t loop_rw_iter After: $ nm -S fs/io_uring.o | grep loop_rw_iter 0000000000002430 000000000000010d t loop_rw_iter Signed-off-by: Dylan Yudaken <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: Remove unneeded test in io_run_task_work_sig()Olivier Langlois1-3/+3
Avoid testing TIF_NOTIFY_SIGNAL twice by calling task_sigpending() directly from io_run_task_work_sig() Signed-off-by: Olivier Langlois <[email protected]> Link: https://lore.kernel.org/r/bd7c0495f7656e803e5736708591bb665e6eaacd.1645041650.git.olivier@trillion01.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io-uring: Make tracepoints consistent.Stefan Roesch2-176/+166
This makes the io-uring tracepoints consistent. Where it makes sense the tracepoints start with the following four fields: - context (ring) - request - user_data - opcode. Signed-off-by: Stefan Roesch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io-uring: add __fill_cqe functionStefan Roesch1-9/+13
This introduces the __fill_cqe function. This is necessary to correctly issue the io_uring_complete tracepoint. Signed-off-by: Stefan Roesch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io-wq: use IO_WQ_ACCT_NR rather than hardcoded numberHao Xu1-2/+2
It's better to use the defined enum stuff not the hardcoded number to define array. Signed-off-by: Hao Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io-wq: reduce acct->lock crossing functions lock/unlockHao Xu1-20/+12
reduce acct->lock lock and unlock in different functions to make the code clearer. Signed-off-by: Hao Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io-wq: decouple work_list protection from the big wqe->lockHao Xu1-44/+52
wqe->lock is abused, it now protects acct->work_list, hash stuff, nr_workers, wqe->free_list and so on. Lets first get the work_list out of the wqe-lock mess by introduce a specific lock for work list. This is the first step to solve the huge contension between work insertion and work consumption. good thing: - split locking for bound and unbound work list - reduce contension between work_list visit and (worker's)free_list. For the hash stuff, since there won't be a work with same file in both bound and unbound work list, thus they won't visit same hash entry. it works well to use the new lock to protect hash stuff. Results: set max_unbound_worker = 4, test with echo-server: nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16 (-n connection, -l workload) before this patch: Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074 Overhead Command Shared Object Symbol 28.59% iou-wrk-10021 [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 8.89% io_uring_echo_s [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 6.20% iou-wrk-10021 [kernel.vmlinux] [k] _raw_spin_lock 2.45% io_uring_echo_s [kernel.vmlinux] [k] io_prep_async_work 2.36% iou-wrk-10021 [kernel.vmlinux] [k] _raw_spin_lock_irqsave 2.29% iou-wrk-10021 [kernel.vmlinux] [k] io_worker_handle_work 1.29% io_uring_echo_s [kernel.vmlinux] [k] io_wqe_enqueue 1.06% iou-wrk-10021 [kernel.vmlinux] [k] io_wqe_worker 1.06% io_uring_echo_s [kernel.vmlinux] [k] _raw_spin_lock 1.03% iou-wrk-10021 [kernel.vmlinux] [k] __schedule 0.99% iou-wrk-10021 [kernel.vmlinux] [k] tcp_sendmsg_locked with this patch: Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943 Overhead Command Shared Object Symbol 16.86% iou-wrk-10893 [kernel.vmlinux] [k] native_queued_spin_lock_slowpat 9.10% iou-wrk-10893 [kernel.vmlinux] [k] _raw_spin_lock 4.53% io_uring_echo_s [kernel.vmlinux] [k] native_queued_spin_lock_slowpat 2.87% iou-wrk-10893 [kernel.vmlinux] [k] io_worker_handle_work 2.57% iou-wrk-10893 [kernel.vmlinux] [k] _raw_spin_lock_irqsave 2.56% io_uring_echo_s [kernel.vmlinux] [k] io_prep_async_work 1.82% io_uring_echo_s [kernel.vmlinux] [k] _raw_spin_lock 1.33% iou-wrk-10893 [kernel.vmlinux] [k] io_wqe_worker 1.26% io_uring_echo_s [kernel.vmlinux] [k] try_to_wake_up spin_lock failure from 25.59% + 8.89% = 34.48% to 16.86% + 4.53% = 21.39% TPS is similar, while cpu usage is from almost 400% to 350% Signed-off-by: Hao Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: Fix use of uninitialized ret in io_eventfd_register()Nathan Chancellor1-3/+3
Clang warns: fs/io_uring.c:9396:9: warning: variable 'ret' is uninitialized when used here [-Wuninitialized] return ret; ^~~ fs/io_uring.c:9373:13: note: initialize the variable 'ret' to silence this warning int fd, ret; ^ = 0 1 warning generated. Just return 0 directly and reduce the scope of ret to the if statement, as that is the only place that it is used, which is how the function was before the fixes commit. Fixes: 1a75fac9a0f9 ("io_uring: avoid ring quiesce while registering/unregistering eventfd") Link: https://github.com/ClangBuiltLinux/linux/issues/1579 Signed-off-by: Nathan Chancellor <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: remove ring quiesce for io_uring_registerUsama Arif1-83/+0
None of the opcodes in io_uring_register use ring quiesce anymore. Hence io_register_op_must_quiesce always returns false and io_ctx_quiesce is never called. Signed-off-by: Usama Arif <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: avoid ring quiesce while registering restrictions and enabling ringsUsama Arif1-0/+2
IORING_SETUP_R_DISABLED prevents submitting requests and so there will be no requests until IORING_REGISTER_ENABLE_RINGS is called. And IORING_REGISTER_RESTRICTIONS works only before IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is not needed for these opcodes. Signed-off-by: Usama Arif <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: avoid ring quiesce while registering async eventfdUsama Arif1-10/+12
This is done using the RCU data structure (io_ev_fd). eventfd_async is moved from io_ring_ctx to io_ev_fd which is RCU protected hence avoiding ring quiesce which is much more expensive than an RCU lock. The place where eventfd_async is read is already under rcu_read_lock so there is no extra RCU read-side critical section needed. Signed-off-by: Usama Arif <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: avoid ring quiesce while registering/unregistering eventfdUsama Arif1-21/+68
This is done by creating a new RCU data structure (io_ev_fd) as part of io_ring_ctx that holds the eventfd_ctx. The function io_eventfd_signal is executed under rcu_read_lock with a single rcu_dereference to io_ev_fd so that if another thread unregisters the eventfd while io_eventfd_signal is still being executed, the eventfd_signal for which io_eventfd_signal was called completes successfully. The process of registering/unregistering eventfd is already done under uring_lock so multiple threads won't enter a race condition while registering/unregistering eventfd. With the above approach ring quiesce can be avoided which is much more expensive then using RCU lock. On the system tested, io_uring_register with IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms before with ring quiesce. Signed-off-by: Usama Arif <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: long line fixups] Signed-off-by: Jens Axboe <[email protected]>
2022-03-10io_uring: remove trace for eventfdUsama Arif2-10/+6
The information on whether eventfd is registered is not very useful and would result in the tracepoint being enclosed in an rcu_readlock in a later patch that tries to avoid ring quiesce for registering eventfd. Suggested-by: Jens Axboe <[email protected]> Signed-off-by: Usama Arif <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-06Linux 5.17-rc7Linus Torvalds1-1/+1
2022-03-06Merge tag 'for-5.17-rc6-tag' of ↵Linus Torvalds13-28/+230
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes for various problems that have user visible effects or seem to be urgent: - fix corruption when combining DIO and non-blocking io_uring over multiple extents (seen on MariaDB) - fix relocation crash due to premature return from commit - fix quota deadlock between rescan and qgroup removal - fix item data bounds checks in tree-checker (found on a fuzzed image) - fix fsync of prealloc extents after EOF - add missing run of delayed items after unlink during log replay - don't start relocation until snapshot drop is finished - fix reversed condition for subpage writers locking - fix warning on page error" * tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fallback to blocking mode when doing async dio over multiple extents btrfs: add missing run of delayed items after unlink during log replay btrfs: qgroup: fix deadlock between rescan worker and remove qgroup btrfs: fix relocation crash due to premature return from btrfs_commit_transaction() btrfs: do not start relocation until in progress drops are done btrfs: tree-checker: use u64 for item data end to avoid overflow btrfs: do not WARN_ON() if we have PageError set btrfs: fix lost prealloc extents beyond eof after full fsync btrfs: subpage: fix a wrong check on subpage->writers
2022-03-06Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds4-14/+20
Pull kvm fixes from Paolo Bonzini: "x86 guest: - Tweaks to the paravirtualization code, to avoid using them when they're pointless or harmful x86 host: - Fix for SRCU lockdep splat - Brown paper bag fix for the propagation of errno" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots() KVM: x86: Yield to IPI target vCPU only if it is busy x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64 x86/kvm: Don't waste memory if kvmclock is disabled x86/kvm: Don't use PV TLB/yield when mwait is advertised
2022-03-06Merge tag 'powerpc-5.17-5' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fix from Michael Ellerman: "Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set. Thanks to Murilo Opsfelder Araujo, and Erhard F" * tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s: Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set
2022-03-06Merge tag 'trace-v5.17-rc5' of ↵Linus Torvalds3-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Fix sorting on old "cpu" value in histograms - Fix return value of __setup() boot parameter handlers * tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix return value of __setup handlers tracing/histogram: Fix sorting on old "cpu" value
2022-03-05Merge branch 'for-linus' of ↵Linus Torvalds6-61/+51
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: - a fixup for Goodix touchscreen driver allowing it to work on certain Cherry Trail devices - a fix for imbalanced enable/disable regulator in Elam touchpad driver that became apparent when used with Asus TF103C 2-in-1 dock - a couple new input keycodes used on newer keyboards * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: HID: add mapping for KEY_ALL_APPLICATIONS HID: add mapping for KEY_DICTATE Input: elan_i2c - fix regulator enable count imbalance after suspend/resume Input: elan_i2c - move regulator_[en|dis]able() out of elan_[en|dis]able_power() Input: goodix - workaround Cherry Trail devices with a bogus ACPI Interrupt() resource Input: goodix - use the new soc_intel_is_byt() helper Input: samsung-keypad - properly state IOMEM dependency
2022-03-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds18-137/+194
Merge misc fixes from Andrew Morton: "8 patches. Subsystems affected by this patch series: mm (hugetlb, pagemap, and userfaultfd), memfd, selftests, and kconfig" * emailed patches from Andrew Morton <[email protected]>: configs/debug: set CONFIG_DEBUG_INFO=y properly proc: fix documentation and description of pagemap kselftest/vm: fix tests build with old libc memfd: fix F_SEAL_WRITE after shmem huge page allocated mm: fix use-after-free when anon vma name is used after vma is freed mm: prevent vm_area_struct::anon_name refcount saturation mm: refactor vm_area_struct::anon_vma_name usage code selftests/vm: cleanup hugetlb file after mremap test
2022-03-05Merge tag 's390-5.17-5' of ↵Linus Torvalds6-7/+62
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Fix HAVE_DYNAMIC_FTRACE_WITH_ARGS implementation by providing correct switching between ftrace_caller/ftrace_regs_caller and supplying pt_regs only when ftrace_regs_caller is activated. - Fix exception table sorting. - Fix breakage of kdump tooling by preserving metadata it cannot function without. * tag 's390-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/extable: fix exception table sorting s390/ftrace: fix arch_ftrace_get_regs implementation s390/ftrace: fix ftrace_caller/ftrace_regs_caller generation s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE
2022-03-05configs/debug: set CONFIG_DEBUG_INFO=y properlyQian Cai1-1/+1
CONFIG_DEBUG_INFO can't be set by user directly, so set CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y instead. Otherwise, we end up with no debuginfo in vmlinux which is a big no-no for kernel debugging. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-05proc: fix documentation and description of pagemapYun Zhou2-2/+3
Since bit 57 was exported for uffd-wp write-protected (commit fb8e37f35a2f: "mm/pagemap: export uffd-wp protection information"), fixing it can reduce some unnecessary confusion. Link: https://lkml.kernel.org/r/[email protected] Fixes: fb8e37f35a2fe1 ("mm/pagemap: export uffd-wp protection information") Signed-off-by: Yun Zhou <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Tiberiu A Georgescu <[email protected]> Cc: Florian Schmidt <[email protected]> Cc: Ivan Teterevkov <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Yang Shi <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Colin Cross <[email protected]> Cc: Alistair Popple <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>