aboutsummaryrefslogtreecommitdiff
path: root/io_uring/rsrc.c
AgeCommit message (Collapse)AuthorFilesLines
2024-10-16io_uring/rsrc: ignore dummy_ubuf for buffer cloningJens Axboe1-1/+2
For placeholder buffers, &dummy_ubuf is assigned which is a static value. When buffers are attempted cloned, don't attempt to grab a reference to it, as we both don't need it and it'll actively fail as dummy_ubuf doesn't have a valid reference count setup. Link: https://lore.kernel.org/io-uring/Zw8dkUzsxQ5LgAJL@ly-workstation/ Reported-by: Lai, Yi <yi1.lai@linux.intel.com> Fixes: 7cc2a6eadcd7 ("io_uring: add IORING_REGISTER_COPY_BUFFERS method") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-24Merge tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linuxLinus Torvalds1-13/+10
Pull more io_uring updates from Jens Axboe: "Mostly just a set of fixes in here, or little changes that didn't get included in the initial pull request. This contains: - Move the SQPOLL napi polling outside the submission lock (Olivier) - Rename of the "copy buffers" API that got added in the 6.12 merge window. There's really no copying going on, it's just referencing the buffers. After a bit of consideration, decided that it was better to simply rename this to avoid potential confusion (me) - Shrink struct io_mapped_ubuf from 48 to 32 bytes, by changing it to start + len tracking rather than having start / end in there, and by removing the caching of folio_mask when we can just calculate it from folio_shift when we need it (me) - Fixes for the SQPOLL affinity checking (me, Felix) - Fix for how cqring waiting checks for the presence of task_work. Just check it directly rather than check for a specific notification mechanism (me) - Tweak to how request linking is represented in tracing (me) - Fix a syzbot report that deliberately sets up a huge list of overflow entries, and then hits rcu stalls when flushing this list. Just check for the need to preempt, and drop/reacquire locks in the loop. There's no state maintained over the loop itself, and each entry is yanked from head-of-list (me)" * tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linux: io_uring: check if we need to reschedule during overflow flush io_uring: improve request linking trace io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNAL io_uring/sqpoll: do the napi busy poll outside the submission block io_uring: clean up a type in io_uring_register_get_file() io_uring/sqpoll: do not put cpumask on stack io_uring/sqpoll: retain test for whether the CPU is valid io_uring/rsrc: change ubuf->ubuf_end to length tracking io_uring/rsrc: get rid of io_mapped_ubuf->folio_mask io_uring: rename "copy buffers" to "clone buffers"
2024-09-16Merge tag 'for-6.12/io_uring-20240913' of git://git.kernel.dk/linuxLinus Torvalds1-42/+203
Pull io_uring updates from Jens Axboe: - NAPI fixes and cleanups (Pavel, Olivier) - Add support for absolute timeouts (Pavel) - Fixes for io-wq/sqpoll affinities (Felix) - Efficiency improvements for dealing with huge pages (Chenliang) - Support for a minwait mode, where the application essentially has two timouts - one smaller one that defines the batch timeout, and the overall large one similar to what we had before. This enables efficient use of batching based on count + timeout, while still working well with periods of less intensive workloads - Use ITER_UBUF for single segment sends - Add support for incremental buffer consumption. Right now each operation will always consume a full buffer. With incremental consumption, a recv/read operation only consumes the part of the buffer that it needs to satisfy the operation - Add support for GCOV for io_uring, to help retain a high coverage of test to code ratio - Fix regression with ocfs2, where an odd -EOPNOTSUPP wasn't correctly converted to a blocking retry - Add support for cloning registered buffers from one ring to another - Misc cleanups (Anuj, me) * tag 'for-6.12/io_uring-20240913' of git://git.kernel.dk/linux: (35 commits) io_uring: add IORING_REGISTER_COPY_BUFFERS method io_uring/register: provide helper to get io_ring_ctx from 'fd' io_uring/rsrc: add reference count to struct io_mapped_ubuf io_uring/rsrc: clear 'slot' entry upfront io_uring/io-wq: inherit cpuset of cgroup in io worker io_uring/io-wq: do not allow pinning outside of cpuset io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common() io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN io_uring/sqpoll: do not allow pinning outside of cpuset io_uring/eventfd: move refs to refcount_t io_uring: remove unused rsrc_put_fn io_uring: add new line after variable declaration io_uring: add GCOV_PROFILE_URING Kconfig option io_uring/kbuf: add support for incremental buffer consumption io_uring/kbuf: pass in 'len' argument for buffer commit Revert "io_uring: Require zeroed sqe->len on provided-buffers send" io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h io_uring/kbuf: add io_kbuf_commit() helper io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg io_uring: wire up min batch wake timeout ...
2024-09-15io_uring/rsrc: change ubuf->ubuf_end to length trackingJens Axboe1-3/+3
If we change it to tracking ubuf->start + ubuf->len, then we can reduce the size of struct io_mapped_ubuf by another 4 bytes, effectively 8 bytes, as a hole is eliminated too. This shrinks io_mapped_ubuf to 32 bytes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-15io_uring/rsrc: get rid of io_mapped_ubuf->folio_maskJens Axboe1-6/+3
We don't really need to cache this, let's reclaim 8 bytes from struct io_mapped_ubuf and just calculate it when we need it. The only hot path here is io_import_fixed(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-14io_uring: rename "copy buffers" to "clone buffers"Jens Axboe1-4/+4
A recent commit added support for copying registered buffers from one ring to another. But that term is a bit confusing, as no copying of buffer data is done here. What is being done is simply cloning the buffer registrations from one ring to another. Rename it while we still can, so that it's more descriptive. No functional changes in this patch. Fixes: 7cc2a6eadcd7 ("io_uring: add IORING_REGISTER_COPY_BUFFERS method") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-12io_uring: add IORING_REGISTER_COPY_BUFFERS methodJens Axboe1-0/+91
Buffers can get registered with io_uring, which allows to skip the repeated pin_pages, unpin/unref pages for each O_DIRECT operation. This reduces the overhead of O_DIRECT IO. However, registrering buffers can take some time. Normally this isn't an issue as it's done at initialization time (and hence less critical), but for cases where rings can be created and destroyed as part of an IO thread pool, registering the same buffers for multiple rings become a more time sensitive proposition. As an example, let's say an application has an IO memory pool of 500G. Initial registration takes: Got 500 huge pages (each 1024MB) Registered 500 pages in 409 msec or about 0.4 seconds. If we go higher to 900 1GB huge pages being registered: Registered 900 pages in 738 msec which is, as expected, a fully linear scaling. Rather than have each ring pin/map/register the same buffer pool, provide an io_uring_register(2) opcode to simply duplicate the buffers that are registered with another ring. Adding the same 900GB of registered buffers to the target ring can then be accomplished in: Copied 900 pages in 17 usec While timing differs a bit, this provides around a 25,000-40,000x speedup for this use case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11io_uring/rsrc: add reference count to struct io_mapped_ubufJens Axboe1-0/+3
Currently there's a single ring owner of a mapped buffer, and hence the reference count will always be 1 when it's torn down and freed. However, in preparation for being able to link io_mapped_ubuf to different spots, add a reference count to manage the lifetime of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11io_uring/rsrc: clear 'slot' entry upfrontJens Axboe1-1/+1
No functional changes in this patch, but clearing the slot pointer earlier will be required by a later change. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-30io_uring/rsrc: ensure compat iovecs are copied correctlyJens Axboe1-4/+15
For buffer registration (or updates), a userspace iovec is copied in and updated. If the application is within a compat syscall, then the iovec type is compat_iovec rather than iovec. However, the type used in __io_sqe_buffers_update() and io_sqe_buffers_register() is always struct iovec, and hence the source is incremented by the size of a non-compat iovec in the loop. This misses every other iovec in the source, and will run into garbage half way through the copies and return -EFAULT to the application. Maintain the source address separately and assign to our user vec pointer, so that copies always happen from the right source address. While in there, correct a bad placement of __user which triggered the following sparse warning prior to this fix: io_uring/rsrc.c:981:33: warning: cast removes address space '__user' of expression io_uring/rsrc.c:981:30: warning: incorrect type in assignment (different address spaces) io_uring/rsrc.c:981:30: expected struct iovec const [noderef] __user *uvec io_uring/rsrc.c:981:30: got struct iovec *[noderef] __user Fixes: f4eaf8eda89e ("io_uring/rsrc: Drop io_copy_iov in favor of iovec API") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring/rsrc: enable multi-hugepage buffer coalescingChenliang Li1-32/+102
Add support for checking and coalescing multi-hugepage-backed fixed buffers. The coalescing optimizes both time and space consumption caused by mapping and storing multi-hugepage fixed buffers. A coalescable multi-hugepage buffer should fully cover its folios (except potentially the first and last one), and these folios should have the same size. These requirements are for easier processing later, also we need same size'd chunks in io_import_fixed for fast iov_iter adjust. Signed-off-by: Chenliang Li <cliang01.li@samsung.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20240731090133.4106-3-cliang01.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring/rsrc: store folio shift and mask into imuChenliang Li1-9/+6
Store the folio shift and folio mask into imu struct and use it in iov_iter adjust, as we will have non PAGE_SIZE'd chunks if a multi-hugepage buffer get coalesced. Signed-off-by: Chenliang Li <cliang01.li@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20240731090133.4106-2-cliang01.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-15Merge tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linuxLinus Torvalds1-41/+22
Pull io_uring updates from Jens Axboe: "Here are the io_uring updates queued up for 6.11. Nothing major this time around, various minor improvements and cleanups/fixes. This contains: - Add bind/listen opcodes. Main motivation is to support direct descriptors, to avoid needing a regular fd just for doing these two operations (Gabriel) - Probe fixes (Gabriel) - Treat io-wq work flags as atomics. Not fixing a real issue, but may as well and it silences a KCSAN warning (me) - Cleanup of rsrc __set_current_state() usage (me) - Add 64-bit for {m,f}advise operations (me) - Improve performance of data ring messages (me) - Fix for ring message overflow posting (Pavel) - Fix for freezer interaction with TWA_NOTIFY_SIGNAL. Not strictly an io_uring thing, but since TWA_NOTIFY_SIGNAL was originally added for faster task_work signaling for io_uring, bundling it with this pull (Pavel) - Add Pavel as a co-maintainer - Various cleanups (me, Thorsten)" * tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux: (28 commits) io_uring/net: check socket is valid in io_bind()/io_listen() kernel: rerun task_work while freezing in get_signal() io_uring/io-wq: limit retrying worker initialisation io_uring/napi: Remove unnecessary s64 cast io_uring/net: cleanup io_recv_finish() bundle handling io_uring/msg_ring: fix overflow posting MAINTAINERS: change Pavel Begunkov from io_uring reviewer to maintainer io_uring/msg_ring: use kmem_cache_free() to free request io_uring/msg_ring: check for dead submitter task io_uring/msg_ring: add an alloc cache for io_kiocb entries io_uring/msg_ring: improve handling of target CQE posting io_uring: add io_add_aux_cqe() helper io_uring: add remote task_work execution helper io_uring/msg_ring: tighten requirement for remote posting io_uring: Allocate only necessary memory in io_probe io_uring: Fix probe of disabled operations io_uring: Introduce IORING_OP_LISTEN io_uring: Introduce IORING_OP_BIND net: Split a __sys_listen helper for io_uring net: Split a __sys_bind helper for io_uring ...
2024-06-20io_uring/rsrc: fix incorrect assignment of iter->nr_segs in io_import_fixedChenliang Li1-1/+0
In io_import_fixed when advancing the iter within the first bvec, the iter->nr_segs is set to bvec->bv_len. nr_segs should be the number of bvecs, plus we don't need to adjust it here, so just remove it. Fixes: b000ae0ec2d7 ("io_uring/rsrc: optimise single entry advance") Signed-off-by: Chenliang Li <cliang01.li@samsung.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20240619063819.2445-1-cliang01.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16io_uring/rsrc: remove redundant __set_current_state() post schedule()Jens Axboe1-2/+1
We're guaranteed to be in a TASK_RUNNING state post schedule, so we never need to set the state after that. While in there, remove the other __set_current_state() as well, and just call finish_wait() when we now we're going to break anyway. This is easier to grok than manual __set_current_state() calls. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16io_uring/rsrc: Drop io_copy_iov in favor of iovec APIGabriel Krisman Bertazi1-39/+21
Instead of open coding an io_uring function to copy iovs from userspace, rely on the existing iovec_from_user function. While there, avoid repeatedly zeroing the iov in the !arg case for io_sqe_buffer_register. tested with liburing testsuite. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240523214535.31890-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12io_uring/rsrc: don't lock while !TASK_RUNNINGPavel Begunkov1-0/+1
There is a report of io_rsrc_ref_quiesce() locking a mutex while not TASK_RUNNING, which is due to forgetting restoring the state back after io_run_task_work_sig() and attempts to break out of the waiting loop. do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff815d2494>] prepare_to_wait+0xa4/0x380 kernel/sched/wait.c:237 WARNING: CPU: 2 PID: 397056 at kernel/sched/core.c:10099 __might_sleep+0x114/0x160 kernel/sched/core.c:10099 RIP: 0010:__might_sleep+0x114/0x160 kernel/sched/core.c:10099 Call Trace: <TASK> __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0xb4/0x940 kernel/locking/mutex.c:752 io_rsrc_ref_quiesce+0x590/0x940 io_uring/rsrc.c:253 io_sqe_buffers_unregister+0xa2/0x340 io_uring/rsrc.c:799 __io_uring_register io_uring/register.c:424 [inline] __do_sys_io_uring_register+0x5b9/0x2400 io_uring/register.c:613 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xd8/0x270 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6f/0x77 Reported-by: Li Shi <sl1589472800@gmail.com> Fixes: 4ea15b56f0810 ("io_uring/rsrc: use wq for quiescing") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/77966bc104e25b0534995d5dbb152332bc8f31c0.1718196953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: move mapping/allocation helpers to a separate fileJens Axboe1-0/+1
Move the related code from io_uring.c into memmap.c. No functional changes in this patch, just cleaning it up a bit now that the full transition is done. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: unify io_pin_pages()Jens Axboe1-36/+0
Move it into io_uring.c where it belongs, and use it in there as well rather than have two implementations of this. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/alloc_cache: switch to array based cachingJens Axboe1-6/+4
Currently lists are being used to manage this, but best practice is usually to have these in an array instead as that it cheaper to manage. Outside of that detail, games are also played with KASAN as the list is inside the cached entry itself. Finally, all users of this need a struct io_cache_entry embedded in their struct, which is union'ized with something else in there that isn't used across the free -> realloc cycle. Get rid of all of that, and simply have it be an array. This will not change the memory used, as we're just trading an 8-byte member entry for the per-elem array size. This reduces the overhead of the recycled allocations, and it reduces the amount of code code needed to support recycling to about half of what it currently is. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19io_uring: drop any code related to SCM_RIGHTSJens Axboe1-165/+4
This is dead code after we dropped support for passing io_uring fds over SCM_RIGHTS, get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20io_uring: fix off-by one bvec indexKeith Busch1-1/+1
If the offset equals the bv_len of the first registered bvec, then the request does not include any of that first bvec. Skip it so that drivers don't have to deal with a zero length bvec, which was observed to break NVMe's PRP list creation. Cc: stable@vger.kernel.org Fixes: bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed buffers") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231120221831.2646460-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-02io_uring/rsrc: cleanup io_pin_pages()Jens Axboe1-20/+17
This function is overly convoluted with a goto error path, and checks under the mmap_read_lock() that don't need to be at all. Rearrange it a bit so the checks and errors fall out naturally, rather than needing to jump around for it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring/rsrc: keep one global dummy_ubufPavel Begunkov1-4/+10
We set empty registered buffers to dummy_ubuf as an optimisation. Currently, we allocate the dummy entry for each ring, whenever we can simply have one global instance. We're casting out const on assignment, it's fine as we're not going to change the content of the dummy, the constness gives us an extra layer of protection if sth ever goes wrong. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4a96dda35ab755914bc43f6781bba0df97ac489.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds1-28/+6
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-20io_uring: add helpers to decode the fixed file file_ptrChristoph Hellwig1-4/+4
Remove all the open coded magic on slot->file_ptr by introducing two helpers that return the file pointer and the flags instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-09mm/gup: remove vmas parameter from pin_user_pages()Lorenzo Stoakes1-1/+1
We are now in a position where no caller of pin_user_pages() requires the vmas parameter at all, so eliminate this parameter from the function and all callers. This clears the way to removing the vmas parameter from GUP altogether. Link: https://lkml.kernel.org/r/195a99ae949c9f5cb589d2222b736ced96ec199a.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> [qib] Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> [drivers/media] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09io_uring: rsrc: delegate VMA file-backed check to GUPLorenzo Stoakes1-28/+6
Now that the GUP explicitly checks FOLL_LONGTERM pin_user_pages() for broken file-backed mappings in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings", there is no need to explicitly check VMAs for this condition, so simply remove this logic from io_uring altogether. Link: https://lkml.kernel.org/r/e4a4efbda9cd12df71e0ed81796dc630231a1ef2.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-07Merge tag 'for-6.4/io_uring-2023-05-07' of git://git.kernel.dk/linuxLinus Torvalds1-1/+6
Pull more io_uring updates from Jens Axboe: "Nothing major in here, just two different parts: - A small series from Breno that enables passing the full SQE down for ->uring_cmd(). This is a prerequisite for enabling full network socket operations. Queued up a bit late because of some stylistic concerns that got resolved, would be nice to have this in 6.4-rc1 so the dependent work will be easier to handle for 6.5. - Fix for the huge page coalescing, which was a regression introduced in the 6.3 kernel release (Tobias)" * tag 'for-6.4/io_uring-2023-05-07' of git://git.kernel.dk/linux: io_uring: Remove unnecessary BUILD_BUG_ON io_uring: Pass whole sqe to commands io_uring: Create a helper to return the SQE size io_uring/rsrc: check for nonconsecutive pages
2023-05-03io_uring/rsrc: check for nonconsecutive pagesTobias Holl1-1/+6
Pages that are from the same folio do not necessarily need to be consecutive. In that case, we cannot consolidate them into a single bvec entry. Before applying the huge page optimization from commit 57bebf807e2a ("io_uring/rsrc: optimise registered huge pages"), check that the memory is actually consecutive. Cc: stable@vger.kernel.org Fixes: 57bebf807e2a ("io_uring/rsrc: optimise registered huge pages") Signed-off-by: Tobias Holl <tobias@tholl.xyz> [axboe: formatting] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-26Merge tag 'net-next-6.4' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the default value allows for better BIG TCP performances - Reduce compound page head access for zero-copy data transfers - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible - Threaded NAPI improvements, adding defer skb free support and unneeded softirq avoidance - Address dst_entry reference count scalability issues, via false sharing avoidance and optimize refcount tracking - Add lockless accesses annotation to sk_err[_soft] - Optimize again the skb struct layout - Extends the skb drop reasons to make it usable by multiple subsystems - Better const qualifier awareness for socket casts BPF: - Add skb and XDP typed dynptrs which allow BPF programs for more ergonomic and less brittle iteration through data and variable-sized accesses - Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward - Add more precise memory usage reporting for all BPF map types - Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params - Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton - Bigger batch of BPF verifier improvements to prepare for upcoming BPF open-coded iterators allowing for less restrictive looping capabilities - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF programs to NULL-check before passing such pointers into kfunc - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in local storage maps - Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps - Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree - Add BPF verifier support for ST instructions in convert_ctx_access() which will help new -mcpu=v4 clang flag to start emitting them - Add ARM32 USDT support to libbpf - Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations Protocols: - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value indicates the provenance of the IP address - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition - Add the handshake upcall mechanism, allowing the user-space to implement generic TLS handshake on kernel's behalf - Bridge: support per-{Port, VLAN} neighbor suppression, increasing resilience to nodes failures - SCTP: add support for Fair Capacity and Weighted Fair Queueing schedulers - MPTCP: delay first subflow allocation up to its first usage. This will allow for later better LSM interaction - xfrm: Remove inner/outer modes from input/output path. These are not needed anymore - WiFi: - reduced neighbor report (RNR) handling for AP mode - HW timestamping support - support for randomized auth/deauth TA for PASN privacy - per-link debugfs for multi-link - TC offload support for mac80211 drivers - mac80211 mesh fast-xmit and fast-rx support - enable Wi-Fi 7 (EHT) mesh support Netfilter: - Add nf_tables 'brouting' support, to force a packet to be routed instead of being bridged - Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header. This is needed for BIT TCP support - The iptables 32bit compat interface isn't compiled in by default anymore - Move ip(6)tables builtin icmp matches to the udptcp one. This has the advantage that icmp/icmpv6 match doesn't load the iptables/ip6tables modules anymore when iptables-nft is used - Extended netlink error report for netdevice in flowtables and netdev/chains. Allow for incrementally add/delete devices to netdev basechain. Allow to create netdev chain without device Driver API: - Remove redundant Device Control Error Reporting Enable, as PCI core has already error reporting enabled at enumeration time - Move Multicast DB netlink handlers to core, allowing devices other then bridge to use them - Allow the page_pool to directly recycle the pages from safely localized NAPI - Implement lockless TX queue stop/wake combo macros, allowing for further code de-duplication and sanitization - Add YNL support for user headers and struct attrs - Add partial YNL specification for devlink - Add partial YNL specification for ethtool - Add tc-mqprio and tc-taprio support for preemptible traffic classes - Add tx push buf len param to ethtool, specifies the maximum number of bytes of a transmitted packet a driver can push directly to the underlying device - Add basic LED support for switch/phy - Add NAPI documentation, stop relaying on external links - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory work to make the hardware timestamping layer selectable by user space - Add transceiver support and improve the error messages for CAN-FD controllers New hardware / drivers: - Ethernet: - AMD/Pensando core device support - MediaTek MT7981 SoC - MediaTek MT7988 SoC - Broadcom BCM53134 embedded switch - Texas Instruments CPSW9G ethernet switch - Qualcomm EMAC3 DWMAC ethernet - StarFive JH7110 SoC - NXP CBTX ethernet PHY - WiFi: - Apple M1 Pro/Max devices - RealTek rtl8710bu/rtl8188gu - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset - Bluetooth: - Realtek RTL8821CS, RTL8851B, RTL8852BS - Mediatek MT7663, MT7922 - NXP w8997 - Actions Semi ATS2851 - QTI WCN6855 - Marvell 88W8997 - Can: - STMicroelectronics bxcan stm32f429 Drivers: - Ethernet NICs: - Intel (1G, icg): - add tracking and reporting of QBV config errors - add support for configuring max SDU for each Tx queue - Intel (100G, ice): - refactor mailbox overflow detection to support Scalable IOV - GNSS interface optimization - Intel (i40e): - support XDP multi-buffer - nVidia/Mellanox: - add the support for linux bridge multicast offload - enable TC offload for egress and engress MACVLAN over bond - add support for VxLAN GBP encap/decap flows offload - extend packet offload to fully support libreswan - support tunnel mode in mlx5 IPsec packet offload - extend XDP multi-buffer support - support MACsec VLAN offload - add support for dynamic msix vectors allocation - drop RX page_cache and fully use page_pool - implement thermal zone to report NIC temperature - Netronome/Corigine: - add support for multi-zone conntrack offload - Solarflare/Xilinx: - support offloading TC VLAN push/pop actions to the MAE - support TC decap rules - support unicast PTP - Other NICs: - Broadcom (bnxt): enforce software based freq adjustments only on shared PHC NIC - RealTek (r8169): refactor to addess ASPM issues during NAPI poll - Micrel (lan8841): add support for PTP_PF_PEROUT - Cadence (macb): enable PTP unicast - Engleder (tsnep): add XDP socket zero-copy support - virtio-net: implement exact header length guest feature - veth: add page_pool support for page recycling - vxlan: add MDB data path support - gve: add XDP support for GQI-QPL format - geneve: accept every ethertype - macvlan: allow some packets to bypass broadcast queue - mana: add support for jumbo frame - Ethernet high-speed switches: - Microchip (sparx5): Add support for TC flower templates - Ethernet embedded switches: - Broadcom (b54): - configure 6318 and 63268 RGMII ports - Marvell (mv88e6xxx): - faster C45 bus scan - Microchip: - lan966x: - add support for IS1 VCAP - better TX/RX from/to CPU performances - ksz9477: add ETS Qdisc support - ksz8: enhance static MAC table operations and error handling - sama7g5: add PTP capability - NXP (ocelot): - add support for external ports - add support for preemptible traffic classes - Texas Instruments: - add CPSWxG SGMII support for J7200 and J721E - Intel WiFi (iwlwifi): - preparation for Wi-Fi 7 EHT and multi-link support - EHT (Wi-Fi 7) sniffer support - hardware timestamping support for some devices/firwmares - TX beacon protection on newer hardware - Qualcomm 802.11ax WiFi (ath11k): - MU-MIMO parameters support - ack signal support for management packets - RealTek WiFi (rtw88): - SDIO bus support - better support for some SDIO devices (e.g. MAC address from efuse) - RealTek WiFi (rtw89): - HW scan support for 8852b - better support for 6 GHz scanning - support for various newer firmware APIs - framework firmware backwards compatibility - MediaTek WiFi (mt76): - P2P support - mesh A-MSDU support - EHT (Wi-Fi 7) support - coredump support" * tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits) net: phy: hide the PHYLIB_LEDS knob net: phy: marvell-88x2222: remove unnecessary (void*) conversions tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. net: amd: Fix link leak when verifying config failed net: phy: marvell: Fix inconsistent indenting in led_blink_set lan966x: Don't use xdp_frame when action is XDP_TX tsnep: Add XDP socket zero-copy TX support tsnep: Add XDP socket zero-copy RX support tsnep: Move skb receive action to separate function tsnep: Add functions for queue enable/disable tsnep: Rework TX/RX queue initialization tsnep: Replace modulo operation with mask net: phy: dp83867: Add led_brightness_set support net: phy: Fix reading LED reg property drivers: nfc: nfcsim: remove return value check of `dev_dir` net: phy: dp83867: Remove unnecessary (void*) conversions net: ethtool: coalesce: try to make user settings stick twice net: mana: Check if netdev/napi_alloc_frag returns single page net: mana: Rename mana_refill_rxoob and remove some empty lines net: veth: add page_pool stats ...
2023-04-20Revert "io_uring/rsrc: disallow multi-source reg buffers"Jens Axboe1-8/+5
This reverts commit edd478269640b360c6f301f2baa04abdda563ef3. There's really no specific need to disallow multiple sources of buffers, and io_uring really should not be mandating this by itself. We should be able to solely rely on GUP making these decisions. As this also stands in the way of a cleanup where io_uring is the odd one out, kill it. Link: https://lore.kernel.org/all/61ded378-51a8-1dcb-b631-fda1903248a9@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: disassociate nodes and rsrc_dataPavel Begunkov1-11/+9
Make rsrc nodes independent from rsrd_data, for that we keep ctx and rsrc type in nodes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4f259abe9cd4eea6a3b4ed83508635218acd3c3f.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: devirtualise rsrc put callbacksPavel Begunkov1-6/+19
We only have two rsrc types, buffers and files, replace virtual callbacks for putting resources down with a switch..case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/02ca727bf8e5f7f820c2f404e95ae88c8f472930.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: pass node to io_rsrc_put_work()Pavel Begunkov1-6/+6
Instead of passing rsrc_data and a resource to io_rsrc_put_work() just forward node, that's all the function needs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/791e8edd28d78797240b74d34e99facbaad62f3b.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: inline io_rsrc_put_work()Pavel Begunkov1-13/+6
io_rsrc_put_work() is simple enough to be open coded into its only caller. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1b36dd46766ced39a9b160767babfa2fce07b8f8.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: add empty flag in rsrc_nodePavel Begunkov1-3/+3
Unless a node was flushed by io_rsrc_ref_quiesce(), it'll carry a resource. Replace ->inline_items with an empty flag, which is initialised to false and only raised in io_rsrc_ref_quiesce(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/75d384c9d2252e12af73b9cf8a44e1699106aeb1.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: merge nodes and io_rsrc_putPavel Begunkov1-68/+23
struct io_rsrc_node carries a number of resources represented by struct io_rsrc_put. That was handy before for sync overhead ammortisation, but all complexity is gone and nodes are simple and lightweight. Let's allocate a separate node for each resource. Nodes and io_rsrc_put and not much different in size, and former are cached, so node allocation should work better. That also removes some overhead for nested iteration in io_rsrc_node_ref_zero() / __io_rsrc_put_work(). Another reason for the patch is that it greatly reduces complexity by moving io_rsrc_node_switch[_start]() inside io_queue_rsrc_removal(), so users don't have to care about it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c7d3a45b30cc14cd93700a710dd112edc703db98.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: infer node from ctx on io_queue_rsrc_removalPavel Begunkov1-4/+5
For io_queue_rsrc_removal() we should always use the current active rsrc node, don't pass it directly but let the function grab it from the context. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d15939b4afea730978b4925685c2577538b823bb.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-15io_uring/rsrc: refactor io_queue_rsrc_removalPavel Begunkov1-4/+1
We can queue up a rsrc into a list in io_queue_rsrc_removal() while allocating io_rsrc_put and so simplify the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/36bd708ee25c0e2e7992dc19b17db166eea9ac40.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-15io_uring/rsrc: clean up __io_sqe_buffers_update()Pavel Begunkov1-2/+1
Inline offset variable, so we don't use it without subjecting it to array_index_nospec() first. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/77936d9ed23755588810c5eafcea7e1c3b90e3cd.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-15io_uring/rsrc: inline switch_start fast pathPavel Begunkov1-7/+5
Inline the part of io_rsrc_node_switch_start() that checks whether the cache is empty or not, as most of the times it will have some number of entries in there. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9619c1717a0e01f22c5fce2f1ba2735f804da0f2.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-15io_uring/rsrc: remove rsrc_data refsPavel Begunkov1-24/+8
Instead of waiting for rsrc_data->refs to be downed to zero, check whether there are rsrc nodes queued for completion, that's easier then maintaining references. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8e33fd143d83e11af3e386aea28eb6d6c6a1be10.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-15io_uring/rsrc: fix DEFER_TASKRUN rsrc quiescePavel Begunkov1-0/+9
For io_rsrc_ref_quiesce() to progress it should execute all task_work items, including deferred ones. However, currently nobody would wake us, and so let's set ctx->cq_wait_nr, so io_req_local_work_add() would wake us up. Fixes: c0e0d6ba25f18 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f1a90d1bc5ebf096475b018fed52e54f3b89d4af.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-15io_uring/rsrc: use wq for quiescingPavel Begunkov1-6/+12
Replace completions with waitqueues for rsrc data quiesce, the main wakeup condition is when data refs hit zero. Note that data refs are only changes under ->uring_lock, so we prepare before mutex_unlock() reacquire it after taking the lock back. This change will be needed in the next patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1d0dbc74b3b4fd67c8f01819e680c5e0da252956.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-15io_uring/rsrc: refactor io_rsrc_ref_quiescePavel Begunkov1-13/+5
Refactor io_rsrc_ref_quiesce() by moving the first mutex_unlock(), so we don't have to have a second mutex_unlock() further in the loop. It prepares us to the next patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/65bc876271fb16bf550a53a4c76c91aacd94e52e.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-15io_uring/rsrc: remove io_rsrc_node::donePavel Begunkov1-4/+1
Kill io_rsrc_node::node and check refs instead, it's set when the nodes refcount hits zero, and it won't change afterwards. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bbde361f4010f7e8bf196f1ecca27a763b79926f.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-15io_uring/rsrc: use nospec'ed indexesPavel Begunkov1-1/+1
We use array_index_nospec() for registered buffer indexes, but don't use it while poking into rsrc tags, fix that. Fixes: 634d00df5e1cf ("io_uring: add full-fledged dynamic buffers support") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f02fafc5a9c0dd69be2b0618c38831c078232ff0.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-12io_uring/rsrc: extract SCM file put helperPavel Begunkov1-9/+11
SCM file accounting is a slow path and is only used for UNIX files. Extract a helper out of io_rsrc_file_put() that does the SCM unaccounting. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/58cc7bffc2ee96bec8c2b89274a51febcbfa5556.1681210788.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-12io_uring/rsrc: refactor io_rsrc_node_switchPavel Begunkov1-25/+11
We use io_rsrc_node_switch() coupled with io_rsrc_node_switch_start() for a bunch of cases including initialising ctx->rsrc_node, i.e. by passing NULL instead of rsrc_data. Leave it to only deal with actual node changing. For that, first remove it from io_uring_create() and add a function allocating the first node. Then also remove all calls to io_rsrc_node_switch() from files/buffers register as we already have a node installed and it does essentially nothing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d146fe306ff98b1a5a60c997c252534f03d423d7.1681210788.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>