aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2024-04-17fsnotify: fix UAF from FS_ERROR event on a shutting down filesystemAmir Goldstein1-0/+4
Protect against use after free when filesystem calls fsnotify_sb_error() during fs shutdown. Move freeing of sb->s_fsnotify_info to destroy_super_work(), because it may be accessed from fs shutdown context. Reported-by: [email protected] Suggested-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/linux-fsdevel/20240416173211.4lnmgctyo4jn5fha@quack3/ Fixes: 07a3b8d0bf72 ("fsnotify: lazy attach fsnotify_sb_info state to sb") Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Jan Kara <[email protected]> Message-Id: <[email protected]>
2024-04-17Merge patch series 'Fix shmem_rename2 directory offset calculation' of ↵Christian Brauner1-0/+2
https://lore.kernel.org/r/[email protected] Pull shmem_rename2() offset fixes from Chuck Lever: The existing code in shmem_rename2() allocates a fresh directory offset value when renaming over an existing destination entry. User space does not expect this behavior. In particular, applications that rename while walking a directory can loop indefinitely because they never reach the end of the directory. * 'Fix shmem_rename2 directory offset calculation' of https://lore.kernel.org/r/[email protected]: (3 commits) shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() fs/libfs.c | 55 +++++++++++++++++++++++++++++++++++++++++----- include/linux/fs.h | 2 ++ mm/shmem.c | 3 +-- 3 files changed, 52 insertions(+), 8 deletions(-) Signed-off-by: Christian Brauner <[email protected]>
2024-04-17libfs: Add simple_offset_rename() APIChuck Lever1-0/+2
I'm about to fix a tmpfs rename bug that requires the use of internal simple_offset helpers that are not available in mm/shmem.c Signed-off-by: Chuck Lever <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-04-17sched/vtime: Do not include <asm/vtime.h> headerAlexander Gordeev1-4/+0
There is no architecture-specific code or data left that generic <linux/vtime.h> needs to know about. Thus, avoid the inclusion of <asm/vtime.h> header. Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: Nicholas Piggin <[email protected]> Link: https://lore.kernel.org/r/f7cd245668b9ae61a55184871aec494ec9199c4a.1712760275.git.agordeev@linux.ibm.com
2024-04-17sched/vtime: Remove confusing arch_vtime_task_switch() declarationAlexander Gordeev1-1/+0
Callback arch_vtime_task_switch() is only defined when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected. Yet, the function prototype forward declaration is present for CONFIG_VIRT_CPU_ACCOUNTING_GEN variant. Remove it. Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]> Link: https://lore.kernel.org/r/783d7c611864f82b0fb9edf01890b9396f3a549a.1712760275.git.agordeev@linux.ibm.com
2024-04-17PNP: Add dev_is_pnp() macroGuanbing Huang1-0/+4
Add dev_is_pnp() macro to determine whether the device is a PNP device. Signed-off-by: Guanbing Huang <[email protected]> Suggested-by: Andy Shevchenko <[email protected]> Reviewed-by: Bing Fan <[email protected]> Tested-by: Linheng Du <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/r/4e68f5557ad53b671ca8103e572163eca52a8f29.1713234515.git.albanhuang@tencent.com Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-04-16net: pse-pd: Rectify and adapt the naming of admin_cotrol member of struct ↵Kory Maincent (Dent Project)1-2/+2
pse_control_config In commit 18ff0bcda6d1 ("ethtool: add interface to interact with Ethernet Power Equipment"), the 'pse_control_config' structure was introduced, housing a single member labeled 'admin_cotrol' responsible for maintaining the operational state of the PoDL PSE functions. A noticeable typographical error exists in the naming of this field ('cotrol' should be corrected to 'control'), which this commit aims to rectify. Furthermore, with upcoming extensions of this structure to encompass PoE functionalities, the field is being renamed to 'podl_admin_state' to distinctly indicate that this state is tailored specifically for PoDL." Reviewed-by: Oleksij Rempel <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: Kory Maincent <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-04-17Add bridged amplifiers to cs42l43Mark Brown22-42/+165
Merge series from Charles Keepax <[email protected]>: In some cs42l43 systems a couple of cs35l56 amplifiers are attached to the cs42l43's SPI and I2S. On Windows the cs42l43 is controlled by a SDCA class driver and these two amplifiers are controlled by firmware running on the cs42l43. However, under Linux the decision was made to interact with the cs42l43 directly, affording the user greater control over the audio system. However, this has resulted in an issue where these two bridged cs35l56 amplifiers are not populated in ACPI and must be added manually. There is at least an SDCA extension unit DT entry we can key off. The process of adding this is handled using a software node, firstly the ability to add native chip selects to software nodes must be added. Secondly, an additional flag for naming the SPI devices is added this allows the machine driver to key to the correct amplifier. Then finally, the cs42l43 SPI driver adds the two amplifiers directly onto its SPI bus. An additional series will follow soon to add the audio machine driver parts (in the sof-sdw driver), however that is fairly orthogonal to this part of the process, getting the actual amplifiers registered.
2024-04-16mm/shmem: inline shmem_is_huge() for disabled transparent hugepagesSumanth Korikkar1-0/+9
In order to minimize code size (CONFIG_CC_OPTIMIZE_FOR_SIZE=y), compiler might choose to make a regular function call (out-of-line) for shmem_is_huge() instead of inlining it. When transparent hugepages are disabled (CONFIG_TRANSPARENT_HUGEPAGE=n), it can cause compilation error. mm/shmem.c: In function `shmem_getattr': ./include/linux/huge_mm.h:383:27: note: in expansion of macro `BUILD_BUG' 383 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; }) | ^~~~~~~~~ mm/shmem.c:1148:33: note: in expansion of macro `HPAGE_PMD_SIZE' 1148 | stat->blksize = HPAGE_PMD_SIZE; To prevent the possible error, always inline shmem_is_huge() when transparent hugepages are disabled. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sumanth Korikkar <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ilya Leoshkevich <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-16mm,swapops: update check in is_pfn_swap_entry for hwpoison entriesOscar Salvador1-32/+33
Tony reported that the Machine check recovery was broken in v6.9-rc1, as he was hitting a VM_BUG_ON when injecting uncorrectable memory errors to DRAM. After some more digging and debugging on his side, he realized that this went back to v6.1, with the introduction of 'commit 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")'. That commit, among other things, introduced swp_offset_pfn(), replacing hwpoison_entry_to_pfn() in its favour. The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), but is_pfn_swap_entry() never got updated to cover hwpoison entries, which means that we would hit the VM_BUG_ON whenever we would call swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM set. Fix this by updating the check to cover hwpoison entries as well, and update the comment while we are it. Link: https://lkml.kernel.org/r/[email protected] Fixes: 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry") Signed-off-by: Oscar Salvador <[email protected]> Reported-by: Tony Luck <[email protected]> Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/ Tested-by: Tony Luck <[email protected]> Reviewed-by: Peter Xu <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Miaohe Lin <[email protected]> Cc: <[email protected]> [6.1.x] Signed-off-by: Andrew Morton <[email protected]>
2024-04-16cgroup/rstat: add cgroup_rstat_lock helpers and tracepointsJesper Dangaard Brouer1-1/+1
This commit enhances the ability to troubleshoot the global cgroup_rstat_lock by introducing wrapper helper functions for the lock along with associated tracepoints. Although global, the cgroup_rstat_lock helper APIs and tracepoints take arguments such as cgroup pointer and cpu_in_loop variable. This adjustment is made because flushing occurs per cgroup despite the lock being global. Hence, when troubleshooting, it's important to identify the relevant cgroup. The cpu_in_loop variable is necessary because the global lock may be released within the main flushing loop that traverses CPUs. In the tracepoints, the cpu_in_loop value is set to -1 when acquiring the main lock; otherwise, it denotes the CPU number processed last. The new feature in this patchset is detecting when lock is contended. The tracepoints are implemented with production in mind. For minimum overhead attach to cgroup:cgroup_rstat_lock_contended, which only gets activated when trylock detects lock is contended. A quick production check for issues could be done via this perf commands: perf record -g -e cgroup:cgroup_rstat_lock_contended Next natural question would be asking how long time do lock contenders wait for obtaining the lock. This can be answered by measuring the time between cgroup:cgroup_rstat_lock_contended and cgroup:cgroup_rstat_locked when args->contended is set. Like this bpftrace script: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); }' Extending with time spend holding the lock will be more expensive as this also looks at all the non-contended cases. Like this bpftrace script: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { @locked[tid]=nsecs; if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} tracepoint:cgroup:cgroup_rstat_unlock { @locked_ns=hist(nsecs-@locked[tid]); delete(@locked[tid]);} interval:s:1 {time("%H:%M:%S "); print(@wait_ns);print(@locked_ns); }' Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-04-16gpio: swnode: Add ability to specify native chip selects for SPICharles Keepax1-0/+4
SPI devices can specify a cs-gpios property to enumerate their chip selects. Under device tree, a zero entry in this property can be used to specify that a particular chip select is using the SPI controllers native chip select, for example: cs-gpios = <&gpio1 0 0>, <0>; Here, the second chip select is native. However, when using swnodes there is currently no way to specify a native chip select. The proposal here is to register a swnode_gpio_undefined software node, that can be specified to allow the indication of a native chip select. For example: static const struct software_node_ref_args device_cs_refs[] = { { .node = &device_gpiochip_swnode, .nargs = 2, .args = { 0, GPIO_ACTIVE_LOW }, }, { .node = &swnode_gpio_undefined, .nargs = 0, }, }; Register the swnode as the gpiolib is initialised and check in swnode_get_gpio_device() if the returned node matches swnode_gpio_undefined and return -ENOENT, which matches the behaviour of the device tree system when it encounters a 0 phandle. Reviewed-by: Linus Walleij <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Signed-off-by: Charles Keepax <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2024-04-16coresight: Add helpers registering/removing both AMBA and platform driversAnshuman Khandual1-0/+7
This adds two different helpers i.e coresight_init_driver()/remove_driver() enabling coresight devices to register or remove AMBA and platform drivers. This changes replicator and funnel devices to use above new helpers. Cc: Suzuki K Poulose <[email protected]> Cc: Mike Leach <[email protected]> Cc: James Clark <[email protected]> Cc: Leo Yan <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: James Clark <[email protected]> Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Suzuki K Poulose <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-15vfs: export remap and write check helpersDarrick J. Wong1-0/+1
Export these functions so that the next patch can use them to check the file ranges being passed to the XFS_IOC_EXCHANGE_RANGE operation. Cc: [email protected] Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2024-04-15remove call_{read,write}_iter() functionsMiklos Szeredi1-12/+0
These have no clear purpose. This is effectively a revert of commit bb7462b6fd64 ("vfs: use helpers for calling f_op->{read,write}_iter()"). The patch was created with the help of a coccinelle script. Fixes: bb7462b6fd64 ("vfs: use helpers for calling f_op->{read,write}_iter()") Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> Signed-off-by: Al Viro <[email protected]>
2024-04-15kernel_file_open(): get rid of inode argumentAl Viro1-1/+1
always equal to ->dentry->d_inode of the path argument these days. Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
2024-04-15fd_is_open(): move to fs/file.cAl Viro1-5/+0
no users outside that... Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
2024-04-15close_on_exec(): pass files_struct instead of fdtableAl Viro1-5/+5
both callers are happier that way... Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
2024-04-15net: dqs: make struct dql more cache efficientBreno Leitao1-2/+3
With the previous change, struct dqs->stall_thrs will be in the hot path (at queue side), even if DQS is disabled. The other fields accessed in this function (last_obj_cnt and num_queued) are in the first cache line, let's move this field (stall_thrs) to the very first cache line, since there is a hole there. This does not change the structure size, since it moves an short (2 bytes) to 4-bytes whole in the first cache line. This is the new structure format now: struct dql { unsigned int num_queued; unsigned int last_obj_cnt; ... short unsigned int stall_thrs; /* XXX 2 bytes hole, try to pack */ ... /* --- cacheline 1 boundary (64 bytes) --- */ ... /* Longest stall detected, reported to user */ short unsigned int stall_max; /* XXX 2 bytes hole, try to pack */ }; Also, read the stall_thrs (now in the very first cache line) earlier, together with dql->num_queued (also in the first cache line). Suggested-by: Jakub Kicinski <[email protected]> Suggested-by: Eric Dumazet <[email protected]> Signed-off-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-04-15net: dql: Optimize stall information populationBreno Leitao1-1/+3
When Dynamic Queue Limit (DQL) is set, it always populate stall information through dql_queue_stall(). However, this information is only necessary if a stall threshold is set, stored in struct dql->stall_thrs. dql_queue_stall() is cheap, but not free, since it does have memory barriers and so forth. Do not call dql_queue_stall() if there is no stall threshold set, and save some CPU cycles. Signed-off-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-04-15net: dql: Separate queue function responsibilitiesBreno Leitao1-19/+25
The dql_queued() function currently handles both queuing object counts and populating bitmaps for reporting stalls. This commit splits the bitmap population into a separate function, allowing for conditional invocation in scenarios where the feature is disabled. This refactor maintains functionality while improving code organization. Signed-off-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-04-15net: dql: Avoid calling BUG() when WARN() is enoughBreno Leitao1-1/+2
If the dql_queued() function receives an invalid argument, WARN about it and continue, instead of crashing the kernel. This was raised by checkpatch, when I am refactoring this code (see following patch/commit) WARNING: Do not crash the kernel unless it is absolutely unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants Signed-off-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-04-15Replace macro "ARCH_HAVE_EXTRA_ELF_NOTES" with kconfigVignesh Balasubramanian1-1/+1
"ARCH_HAVE_EXTRA_ELF_NOTES" enables an extra note section in the core dump. Kconfig variable is preferred over ARCH_HAVE_* macro. Co-developed-by: Jini Susan George <[email protected]> Signed-off-by: Jini Susan George <[email protected]> Signed-off-by: Vignesh Balasubramanian <[email protected]> Acked-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2024-04-15rcu: Mollify sparse with RCU guardJohannes Berg1-1/+13
When using "guard(rcu)();" sparse will complain, because even though it now understands the cleanup attribute, it doesn't evaluate the calls from it at function exit, and thus doesn't count the context correctly. Given that there's a conditional in the resulting code: static inline void class_rcu_destructor(class_rcu_t *_T) { if (_T->lock) { rcu_read_unlock(); } } it seems that even trying to teach sparse to evalulate the cleanup attribute function it'd still be difficult to really make it understand the full context here. Suppress the sparse warning by just releasing the context in the acquisition part of the function, after all we know it's safe with the guard, that's the whole point of it. Signed-off-by: Johannes Berg <[email protected]> Reviewed-by: Boqun Feng <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-15dma-buf: Do not build debugfs related code when !CONFIG_DEBUG_FSTvrtko Ursulin1-0/+2
There is no point in compiling in the list and mutex operations which are only used from the dma-buf debugfs code, if debugfs is not compiled in. Put the code in questions behind some kconfig guards and so save some text and maybe even a pointer per object at runtime when not enabled. Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: Christian König <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: T.J. Mercier <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Maíra Canal <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-04-15rcu: Remove redundant CONFIG_PROVE_RCU #if conditionPaul E. McKenney1-3/+3
The #if condition controlling the rcu_preempt_sleep_check() definition has a redundant check for CONFIG_PREEMPT_RCU, which is already checked for by an enclosing #ifndef. This commit therefore removes this redundant condition from the inner #if. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-15vfs, swap: compile out IS_SWAPFILE() on swapless configsAlexey Dobriyan1-0/+6
No swap support -- no swapfiles possible. Signed-off-by: Alexey Dobriyan (Yandex) <[email protected]> Link: https://lore.kernel.org/r/2391c7f5-0f83-4188-ae56-4ec7ccbf2576@p183 Signed-off-by: Christian Brauner <[email protected]>
2024-04-15io_uring: separate header for exported net bitsPavel Begunkov2-6/+18
We're exporting some io_uring bits to networking, e.g. for implementing a net callback for io_uring cmds, but we don't want to expose more than needed. Add a separate header for networking. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: David Wei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: remove async request cachePavel Begunkov1-4/+0
io_req_complete_post() was a sole user of ->locked_free_list, but since we just gutted the function, the cache is not used anymore and can be removed. ->locked_free_list served as an asynhronous counterpart of the main request (i.e. struct io_kiocb) cache for all unlocked cases like io-wq. Now they're all forced to be completed into the main cache directly, off of the normal completion path or via io_free_req(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/7bffccd213e370abd4de480e739d8b08ab6c1326.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ringJens Axboe1-3/+0
Rather than use remap_pfn_range() for this and manually free later, switch to using vm_insert_page() and have it Just Work. This requires a bit of effort on the mmap lookup side, as the ctx uring_lock isn't held, which otherwise protects buffer_lists from being torn down, and it's not safe to grab from mmap context that would introduce an ABBA deadlock between the mmap lock and the ctx uring_lock. Instead, lookup the buffer_list under RCU, as the the list is RCU freed already. Use the existing reference count to determine whether it's possible to safely grab a reference to it (eg if it's not zero already), and drop that reference when done with the mapping. If the mmap reference is the last one, the buffer_list and the associated memory can go away, since the vma insertion has references to the inserted pages at that point. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/alloc_cache: switch to array based cachingJens Axboe1-1/+1
Currently lists are being used to manage this, but best practice is usually to have these in an array instead as that it cheaper to manage. Outside of that detail, games are also played with KASAN as the list is inside the cached entry itself. Finally, all users of this need a struct io_cache_entry embedded in their struct, which is union'ized with something else in there that isn't used across the free -> realloc cycle. Get rid of all of that, and simply have it be an array. This will not change the memory used, as we're just trading an 8-byte member entry for the per-elem array size. This reduces the overhead of the recycled allocations, and it reduces the amount of code code needed to support recycling to about half of what it currently is. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/uring_cmd: switch to always allocating async dataJens Axboe1-0/+1
Basic conversion ensuring async_data is allocated off the prep path. Adds a basic alloc cache as well, as passthrough IO can be quite high in rate. Tested-by: Anuj Gupta <[email protected]> Reviewed-by: Anuj Gupta <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/rw: always setup io_async_rw for read/write requestsJens Axboe1-0/+1
read/write requests try to put everything on the stack, and then alloc and copy if a retry is needed. This necessitates a bunch of nasty code that deals with intermediate state. Get rid of this, and have the prep side setup everything that is needed upfront, which greatly simplifies the opcode handlers. This includes adding an alloc cache for io_async_rw, to make it cheap to handle. In terms of cost, this should be basically free and transparent. For the worst case of {READ,WRITE}_FIXED which didn't need it before, performance is unaffected in the normal peak workload that is being used to test that. Still runs at 122M IOPS. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: get rid of intermediate aux cqe cachesPavel Begunkov1-2/+1
io_post_aux_cqe(), which is used for multishot requests, delays completions by putting CQEs into a temporary array for the purpose completion lock/flush batching. DEFER_TASKRUN doesn't need any locking, so for it we can put completions directly into the CQ and defer post completion handling with a flag. That leaves !DEFER_TASKRUN, which is not that interesting / hot for multishot requests, so have conditional locking with deferred flush for them. Signed-off-by: Pavel Begunkov <[email protected]> Tested-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/b1d05a81fd27aaa2a07f9860af13059e7ad7a890.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: remove struct io_tw_state::lockedPavel Begunkov1-2/+0
ctx is always locked for task_work now, so get rid of struct io_tw_state::locked. Note I'm stopping one step before removing io_tw_state altogether, which is not empty, because it still serves the purpose of indicating which function is a tw callback and forcing users not to invoke them carelessly out of a wrong context. The removal can always be done later. Signed-off-by: Pavel Begunkov <[email protected]> Tested-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/e95e1ea116d0bfa54b656076e6a977bc221392a4.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15nvme/io_uring: use helper for polled completionsJens Axboe1-0/+11
NVMe is making up issue_flags, which is a no-no in general, and to make matters worse, they are completely the wrong ones. For a pure polled request, which it does check for, we're already inside the ctx->uring_lock when the completions are run off io_do_iopoll(). Hence the correct flag would be '0' rather than IO_URING_F_UNLOCKED. Reviewed-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/cmd: document some uring_cmd related helpersPavel Begunkov1-0/+13
Add comments warning users that they're only allowed to pass issue_flags that were given from io_uring. Signed-off-by: Pavel Begunkov <[email protected]> Reviewed-by: Ming Lei <[email protected]> Tested-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/82ff8a45f2c3eb5f3a04a33f0692e5e4a1320455.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15ACPI: platform-profile: add platform_profile_cycle()Gergo Koteles1-0/+1
Some laptops have a key to switch platform profiles. Add a platform_profile_cycle() function to cycle between the enabled profiles. Signed-off-by: Gergo Koteles <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/r/5a97deddf72aa5e764d881eb39a7ba35c01a903e.1712597199.git.soyer@irl.hu Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Hans de Goede <[email protected]>
2024-04-15gpiolib: acpi: Pass con_id instead of property into acpi_dev_gpio_irq_get_by()Andy Shevchenko1-4/+4
Pass the con_id instead of property so that callers won't repeat the GPIO suffixes to try. Acked-by: Mika Westerberg <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]>
2024-04-15iommu: account IOMMU allocated memoryPasha Tatashin1-1/+1
In order to be able to limit the amount of memory that is allocated by IOMMU subsystem, the memory must be accounted. Account IOMMU as part of the secondary pagetables as it was discussed at LPC. The value of SecPageTables now contains mmeory allocation by IOMMU and KVM. There is a difference between GFP_ACCOUNT and what NR_IOMMU_PAGES shows. GFP_ACCOUNT is set only where it makes sense to charge to user processes, i.e. IOMMU Page Tables, but there more IOMMU shared data that should not really be charged to a specific process. Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: David Rientjes <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2024-04-15iommu: observability of the IOMMU allocationsPasha Tatashin1-0/+3
Add NR_IOMMU_PAGES into node_stat_item that counts number of pages that are allocated by the IOMMU subsystem. The allocations can be view per-node via: /sys/devices/system/node/nodeN/vmstat. For example: $ grep iommu /sys/devices/system/node/node*/vmstat /sys/devices/system/node/node0/vmstat:nr_iommu_pages 106025 /sys/devices/system/node/node1/vmstat:nr_iommu_pages 3464 The value is in page-count, therefore, in the above example the iommu allocations amount to ~428M. Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: David Rientjes <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2024-04-15srcu: Make Tiny SRCU explicitly disable preemptionPaul E. McKenney1-0/+2
Because Tiny SRCU is used only in kernels built with either CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y, there has not been any need for TINY SRCU to explicitly disable preemption. However, the prospect of lazy preemption changes that, and the lazy-preemption patches do result in rcutorture runs finding both too-short grace periods and grace-period hangs for Tiny SRCU. This commit therefore adds the needed preempt_disable() and preempt_enable() calls to Tiny SRCU. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Ankur Arora <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-15rcu: Update lockdep while in RCU read-side critical sectionUladzislau Rezki (Sony)1-1/+1
With Ankur's lazy-/auto-preemption patches applied and with a lazy-preemptible kernel in combination with a non-preemptible RCU, lockdep sometimes complains about context switches within RCU read-side critical sections. This is a false positive due to rcu_read_unlock() updating lockdep state too late: __release(RCU); __rcu_read_unlock(); // Context switch here results in lockdep false positive!!! rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */ Although this complaint could also happen with preemptible RCU in a preemptible kernel, the odds of that happening aer quite low. In constrast, with non-preemptible RCU, a long critical section has a high probability of performing a context switch from the preempt_enable() in __rcu_read_unlock(). The fix is straightforward, just move the rcu_lock_release() within rcu_read_unlock() to obtain the reverse order from that of rcu_read_lock(): rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */ __release(RCU); __rcu_read_unlock(); This commit makes this change. Co-developed-by: Frederic Weisbecker <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Co-developed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Co-developed-by: Boqun Feng <[email protected]> Signed-off-by: Boqun Feng <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Cc: Ankur Arora <[email protected]> Cc: Thomas Gleixner <[email protected]>
2024-04-14perf: Move perf_event_fasync() to perf_event.hKyle Huey1-0/+8
This will allow it to be called from perf_output_wakeup(). Signed-off-by: Kyle Huey <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-14Merge branch 'linus' into perf/core, to pick up perf/urgent fixesIngo Molnar26-50/+185
Pick up perf/urgent fixes that are upstream already, but not yet in the perf/core development branch. Signed-off-by: Ingo Molnar <[email protected]>
2024-04-14Merge tag 'timers-urgent-2024-04-14' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: - Address a (valid) W=1 build warning - Fix timer self-tests - Annotate a KCSAN warning wrt. accesses to the tick_do_timer_cpu global variable - Address a !CONFIG_BUG build warning * tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftests: kselftest: Fix build failure with NOLIBC selftests: timers: Fix abs() warning in posix_timers test selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn selftests: timers: Fix posix_timers ksft_print_msg() warning selftests: timers: Fix valid-adjtimex signed left-shift undefined behavior bug: Fix no-return-statement warning with !CONFIG_BUG timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpu selftests/timers/posix_timers: Reimplement check_timer_distribution() irqflags: Explicitly ignore lockdep_hrtimer_exit() argument
2024-04-14Merge tag 'locking-urgent-2024-04-14' of ↵Linus Torvalds2-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix a PREEMPT_RT build bug" * tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking: Make rwsem_assert_held_write_nolockdep() build with PREEMPT_RT=y
2024-04-14Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-2/+5
Pull virtio bugfixes from Michael Tsirkin: "Some small, obvious (in hindsight) bugfixes: - new ioctl in vhost-vdpa has a wrong # - not too late to fix - vhost has apparently been lacking an smp_rmb() - due to code duplication :( The duplication will be fixed in the next merge cycle, this is a minimal fix - an error message in vhost talks about guest moving used index - which of course never happens, guest only ever moves the available index - i2c-virtio didn't set the driver owner so it did not get refcounted correctly" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost: correct misleading printing information vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE virtio: store owner from modules with register_virtio_driver() vhost: Add smp_rmb() in vhost_enable_notify() vhost: Add smp_rmb() in vhost_vq_avail_empty()
2024-04-14net: change maximum number of UDP segments to 128Yuri Benditovich1-1/+1
The commit fc8b2a619469 ("net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation") adds check of potential number of UDP segments vs UDP_MAX_SEGMENTS in linux/virtio_net.h. After this change certification test of USO guest-to-guest transmit on Windows driver for virtio-net device fails, for example with packet size of ~64K and mss of 536 bytes. In general the USO should not be more restrictive than TSO. Indeed, in case of unreasonably small mss a lot of segments can cause queue overflow and packet loss on the destination. Limit of 128 segments is good for any practical purpose, with minimal meaningful mss of 536 the maximal UDP packet will be divided to ~120 segments. The number of segments for UDP packets is validated vs UDP_MAX_SEGMENTS also in udp.c (v4,v6), this does not affect quest-to-guest path but does affect packets sent to host, for example. It is important to mention that UDP_MAX_SEGMENTS is kernel-only define and not available to user mode socket applications. In order to request MSS smaller than MTU the applications just uses setsockopt with SOL_UDP and UDP_SEGMENT and there is no limitations on socket API level. Fixes: fc8b2a619469 ("net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation") Signed-off-by: Yuri Benditovich <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-04-14bootconfig: use memblock_free_late to free xbc memory to buddyQiang Zhang1-1/+6
On the time to free xbc memory in xbc_exit(), memblock may has handed over memory to buddy allocator. So it doesn't make sense to free memory back to memblock. memblock_free() called by xbc_exit() even causes UAF bugs on architectures with CONFIG_ARCH_KEEP_MEMBLOCK disabled like x86. Following KASAN logs shows this case. This patch fixes the xbc memory free problem by calling memblock_free() in early xbc init error rewind path and calling memblock_free_late() in xbc exit path to free memory to buddy allocator. [ 9.410890] ================================================================== [ 9.418962] BUG: KASAN: use-after-free in memblock_isolate_range+0x12d/0x260 [ 9.426850] Read of size 8 at addr ffff88845dd30000 by task swapper/0/1 [ 9.435901] CPU: 9 PID: 1 Comm: swapper/0 Tainted: G U 6.9.0-rc3-00208-g586b5dfb51b9 #5 [ 9.446403] Hardware name: Intel Corporation RPLP LP5 (CPU:RaptorLake)/RPLP LP5 (ID:13), BIOS IRPPN02.01.01.00.00.19.015.D-00000000 Dec 28 2023 [ 9.460789] Call Trace: [ 9.463518] <TASK> [ 9.465859] dump_stack_lvl+0x53/0x70 [ 9.469949] print_report+0xce/0x610 [ 9.473944] ? __virt_addr_valid+0xf5/0x1b0 [ 9.478619] ? memblock_isolate_range+0x12d/0x260 [ 9.483877] kasan_report+0xc6/0x100 [ 9.487870] ? memblock_isolate_range+0x12d/0x260 [ 9.493125] memblock_isolate_range+0x12d/0x260 [ 9.498187] memblock_phys_free+0xb4/0x160 [ 9.502762] ? __pfx_memblock_phys_free+0x10/0x10 [ 9.508021] ? mutex_unlock+0x7e/0xd0 [ 9.512111] ? __pfx_mutex_unlock+0x10/0x10 [ 9.516786] ? kernel_init_freeable+0x2d4/0x430 [ 9.521850] ? __pfx_kernel_init+0x10/0x10 [ 9.526426] xbc_exit+0x17/0x70 [ 9.529935] kernel_init+0x38/0x1e0 [ 9.533829] ? _raw_spin_unlock_irq+0xd/0x30 [ 9.538601] ret_from_fork+0x2c/0x50 [ 9.542596] ? __pfx_kernel_init+0x10/0x10 [ 9.547170] ret_from_fork_asm+0x1a/0x30 [ 9.551552] </TASK> [ 9.555649] The buggy address belongs to the physical page: [ 9.561875] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x45dd30 [ 9.570821] flags: 0x200000000000000(node=0|zone=2) [ 9.576271] page_type: 0xffffffff() [ 9.580167] raw: 0200000000000000 ffffea0011774c48 ffffea0012ba1848 0000000000000000 [ 9.588823] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 [ 9.597476] page dumped because: kasan: bad access detected [ 9.605362] Memory state around the buggy address: [ 9.610714] ffff88845dd2ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 9.618786] ffff88845dd2ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 9.626857] >ffff88845dd30000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 9.634930] ^ [ 9.638534] ffff88845dd30080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 9.646605] ffff88845dd30100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 9.654675] ================================================================== Link: https://lore.kernel.org/all/[email protected]/ Fixes: 40caa127f3c7 ("init: bootconfig: Remove all bootconfig data when the init memory is removed") Cc: [email protected] Signed-off-by: Qiang Zhang <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>