aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-05-23mseal: add mseal syscallJeff Xu8-1/+432
The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. Following input during RFC are incooperated into this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Linus Torvalds: assisting in defining system call signature and scope. Liam R. Howlett: perf optimization. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. Finally, the idea that inspired this patch comes from Stephen Röttger's work in Chrome V8 CFI. [[email protected]: add branch prediction hint, per Pedro] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jeff Xu <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Jorge Lucangeli Obes <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muhammad Usama Anjum <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Stephen Röttger <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Amer Al Shanawany <[email protected]> Cc: Javier Carrasco <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-23mseal: wire up mseal syscallJeff Xu19-2/+23
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed. To measure the performance impact of this loop, two tests are developed. [8] The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results. The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples. Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook. Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__ 32 11811 12398 587 18 105% mprotect 1 439 465 26 26 106% mprotect 2 1659 1745 86 43 105% mprotect 4 3747 3889 142 36 104% mprotect 8 6755 6969 215 27 103% mprotect 16 13748 14144 396 25 103% mprotect 32 27827 28969 1142 36 104% madvise_ 1 240 262 22 22 109% madvise_ 2 366 442 76 38 121% madvise_ 4 623 751 128 32 121% madvise_ 8 1110 1324 215 27 119% madvise_ 16 2127 2451 324 20 115% madvise_ 32 4109 4642 534 17 113% The second test (measuring cpu cycle) syscall__ vmas cpu cmseal delta_cpu per_vma % munmap__ 1 1790 1890 100 100 106% munmap__ 2 2819 3033 214 107 108% munmap__ 4 4959 5271 312 78 106% munmap__ 8 8262 8745 483 60 106% munmap__ 16 13099 14116 1017 64 108% munmap__ 32 23221 24785 1565 49 107% mprotect 1 906 967 62 62 107% mprotect 2 3019 3203 184 92 106% mprotect 4 6149 6569 420 105 107% mprotect 8 9978 10524 545 68 105% mprotect 16 20448 21427 979 61 105% mprotect 32 40972 42935 1963 61 105% madvise_ 1 434 497 63 63 115% madvise_ 2 752 899 147 74 120% madvise_ 4 1313 1513 200 50 115% madvise_ 8 2271 2627 356 44 116% madvise_ 16 4312 4883 571 36 113% madvise_ 32 8376 9319 943 29 111% Based on the result, for 6.8 kernel, sealing check adds 20-40 nano seconds, or around 50-100 CPU cycles, per VMA. In addition, I applied the sealing to 5.10 kernel: The first test (measuring time) syscall__ vmas t tmseal delta_ns per_vma % munmap__ 1 357 390 33 33 109% munmap__ 2 442 463 21 11 105% munmap__ 4 614 634 20 5 103% munmap__ 8 1017 1137 120 15 112% munmap__ 16 1889 2153 263 16 114% munmap__ 32 4109 4088 -21 -1 99% mprotect 1 235 227 -7 -7 97% mprotect 2 495 464 -30 -15 94% mprotect 4 741 764 24 6 103% mprotect 8 1434 1437 2 0 100% mprotect 16 2958 2991 33 2 101% mprotect 32 6431 6608 177 6 103% madvise_ 1 191 208 16 16 109% madvise_ 2 300 324 24 12 108% madvise_ 4 450 473 23 6 105% madvise_ 8 753 806 53 7 107% madvise_ 16 1467 1592 125 8 108% madvise_ 32 2795 3405 610 19 122% The second test (measuring cpu cycle) syscall__ nbr_vma cpu cmseal delta_cpu per_vma % munmap__ 1 684 715 31 31 105% munmap__ 2 861 898 38 19 104% munmap__ 4 1183 1235 51 13 104% munmap__ 8 1999 2045 46 6 102% munmap__ 16 3839 3816 -23 -1 99% munmap__ 32 7672 7887 216 7 103% mprotect 1 397 443 46 46 112% mprotect 2 738 788 50 25 107% mprotect 4 1221 1256 35 9 103% mprotect 8 2356 2429 72 9 103% mprotect 16 4961 4935 -26 -2 99% mprotect 32 9882 10172 291 9 103% madvise_ 1 351 380 29 29 108% madvise_ 2 565 615 49 25 109% madvise_ 4 872 933 61 15 107% madvise_ 8 1508 1640 132 16 109% madvise_ 16 3078 3323 245 15 108% madvise_ 32 5893 6704 811 25 114% For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30 CPU cycles, there is even decrease in some cases. It might be interesting to compare 5.10 and 6.8 kernel The first test (measuring time) syscall__ vmas t_5_10 t_6_8 delta_ns per_vma % munmap__ 1 357 909 552 552 254% munmap__ 2 442 1398 956 478 316% munmap__ 4 614 2444 1830 458 398% munmap__ 8 1017 4029 3012 377 396% munmap__ 16 1889 6647 4758 297 352% munmap__ 32 4109 11811 7702 241 287% mprotect 1 235 439 204 204 187% mprotect 2 495 1659 1164 582 335% mprotect 4 741 3747 3006 752 506% mprotect 8 1434 6755 5320 665 471% mprotect 16 2958 13748 10790 674 465% mprotect 32 6431 27827 21397 669 433% madvise_ 1 191 240 49 49 125% madvise_ 2 300 366 67 33 122% madvise_ 4 450 623 173 43 138% madvise_ 8 753 1110 357 45 147% madvise_ 16 1467 2127 660 41 145% madvise_ 32 2795 4109 1314 41 147% The second test (measuring cpu cycle) syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma % munmap__ 1 684 1790 1106 1106 262% munmap__ 2 861 2819 1958 979 327% munmap__ 4 1183 4959 3776 944 419% munmap__ 8 1999 8262 6263 783 413% munmap__ 16 3839 13099 9260 579 341% munmap__ 32 7672 23221 15549 486 303% mprotect 1 397 906 509 509 228% mprotect 2 738 3019 2281 1140 409% mprotect 4 1221 6149 4929 1232 504% mprotect 8 2356 9978 7622 953 423% mprotect 16 4961 20448 15487 968 412% mprotect 32 9882 40972 31091 972 415% madvise_ 1 351 434 82 82 123% madvise_ 2 565 752 186 93 133% madvise_ 4 872 1313 442 110 151% madvise_ 8 1508 2271 763 95 151% madvise_ 16 3078 4312 1234 77 140% madvise_ 32 5893 8376 2483 78 142% From 5.10 to 6.8 munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma. mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma. madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma. In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the increase from 5.10 to 6.8 is significantly larger, approximately ten times greater for munmap and mprotect. When I discuss the mm performance with Brian Makin, an engineer who worked on performance, it was brought to my attention that such performance benchmarks, which measuring millions of mm syscall in a tight loop, may not accurately reflect real-world scenarios, such as that of a database service. Also this is tested using a single HW and ChromeOS, the data from another HW or distribution might be different. It might be best to take this data with a grain of salt. This patch (of 5): Wire up mseal syscall for all architectures. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jeff Xu <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Jann Horn <[email protected]> [Bug #2] Cc: Jeff Xu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Jorge Lucangeli Obes <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muhammad Usama Anjum <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Stephen Röttger <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Amer Al Shanawany <[email protected]> Cc: Javier Carrasco <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-24uprobes: prevent mutex_lock() under rcu_read_lock()Andrii Nakryiko1-5/+9
Recent changes made uprobe_cpu_buffer preparation lazy, and moved it deeper into __uprobe_trace_func(). This is problematic because __uprobe_trace_func() is called inside rcu_read_lock()/rcu_read_unlock() block, which then calls prepare_uprobe_buffer() -> uprobe_buffer_get() -> mutex_lock(&ucb->mutex), leading to a splat about using mutex under non-sleepable RCU: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 98231, name: stress-ng-sigq preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 ... Call Trace: <TASK> dump_stack_lvl+0x3d/0xe0 __might_resched+0x24c/0x270 ? prepare_uprobe_buffer+0xd5/0x1d0 __mutex_lock+0x41/0x820 ? ___perf_sw_event+0x206/0x290 ? __perf_event_task_sched_in+0x54/0x660 ? __perf_event_task_sched_in+0x54/0x660 prepare_uprobe_buffer+0xd5/0x1d0 __uprobe_trace_func+0x4a/0x140 uprobe_dispatcher+0x135/0x280 ? uprobe_dispatcher+0x94/0x280 uprobe_notify_resume+0x650/0xec0 ? atomic_notifier_call_chain+0x21/0x110 ? atomic_notifier_call_chain+0xf8/0x110 irqentry_exit_to_user_mode+0xe2/0x1e0 asm_exc_int3+0x35/0x40 RIP: 0033:0x7f7e1d4da390 Code: 33 04 00 0f 1f 80 00 00 00 00 f3 0f 1e fa b9 01 00 00 00 e9 b2 fc ff ff 66 90 f3 0f 1e fa 31 c9 e9 a5 fc ff ff 0f 1f 44 00 00 <cc> 0f 1e fa b8 27 00 00 00 0f 05 c3 0f 1f 40 00 f3 0f 1e fa b8 6e RSP: 002b:00007ffd2abc3608 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000076d325f1 RCX: 0000000000000000 RDX: 0000000076d325f1 RSI: 000000000000000a RDI: 00007ffd2abc3690 RBP: 000000000000000a R08: 00017fb700000000 R09: 00017fb700000000 R10: 00017fb700000000 R11: 0000000000000246 R12: 0000000000017ff2 R13: 00007ffd2abc3610 R14: 0000000000000000 R15: 00007ffd2abc3780 </TASK> Luckily, it's easy to fix by moving prepare_uprobe_buffer() to be called slightly earlier: into uprobe_trace_func() and uretprobe_trace_func(), outside of RCU locked section. This still keeps this buffer preparation lazy and helps avoid the overhead when it's not needed. E.g., if there is only BPF uprobe handler installed on a given uprobe, buffer won't be initialized. Note, the other user of prepare_uprobe_buffer(), __uprobe_perf_func(), is not affected, as it doesn't prepare buffer under RCU read lock. Link: https://lore.kernel.org/all/[email protected]/ Fixes: 1b8f85defbc8 ("uprobes: prepare uprobe args buffer lazily") Reported-by: Breno Leitao <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-05-23drm/amd/display: Enable colorspace property for MST connectorsMario Limonciello1-0/+3
MST colorspace property support was disabled due to a series of warnings that came up when the device was plugged in since the properties weren't made at device creation. Create the properties in advance instead. Suggested-by: Ville Syrjälä <[email protected]> Fixes: 69a959610229 ("drm/amd/display: Temporary Disable MST DP Colorspace Property"). Reported-and-tested-by: Tyler Schneider <[email protected]> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3353 Reviewed-by: Harry Wentland <[email protected]> Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-05-23drm/amdgpu: correct hbm field in boot statusHawking Zhang1-1/+1
hbm filed takes bit 13 and bit 14 in boot status. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-05-23Merge tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds16-68/+129
Pull NFS client updates from Trond Myklebust: "Stable fixes: - nfs: fix undefined behavior in nfs_block_bits() - NFSv4.2: Fix READ_PLUS when server doesn't support OP_READ_PLUS Bugfixes: - Fix mixing of the lock/nolock and local_lock mount options - NFSv4: Fixup smatch warning for ambiguous return - NFSv3: Fix remount when using the legacy binary mount api - SUNRPC: Fix the handling of expired RPCSEC_GSS contexts - SUNRPC: fix the NFSACL RPC retries when soft mounts are enabled - rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL Features and cleanups: - NFSv3: Use the atomic_open API to fix open(O_CREAT|O_TRUNC) - pNFS/filelayout: S layout segment range in LAYOUTGET - pNFS: rework pnfs_generic_pg_check_layout to check IO range - NFSv2: Turn off enabling of NFS v2 by default" * tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: fix undefined behavior in nfs_block_bits() pNFS: rework pnfs_generic_pg_check_layout to check IO range pNFS/filelayout: check layout segment range pNFS/filelayout: fixup pNfs allocation modes rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL NFS: Don't enable NFS v2 by default NFS: Fix READ_PLUS when server doesn't support OP_READ_PLUS sunrpc: fix NFSACL RPC retry on soft mount SUNRPC: fix handling expired GSS context nfs: keep server info for remounts NFSv4: Fixup smatch warning for ambiguous return NFS: make sure lock/nolock overriding local_lock mount option NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly. pNFS/filelayout: Specify the layout segment range in LAYOUTGET pNFS/filelayout: Remove the whole file layout requirement
2024-05-23Merge tag 'block-6.10-20240523' of git://git.kernel.dk/linuxLinus Torvalds26-186/+314
Pull more block updates from Jens Axboe: "Followup block updates, mostly due to NVMe being a bit late to the party. But nothing major in there, so not a big deal. In detail, this contains: - NVMe pull request via Keith: - Fabrics connection retries (Daniel, Hannes) - Fabrics logging enhancements (Tokunori) - RDMA delete optimization (Sagi) - ublk DMA alignment fix (me) - null_blk sparse warning fixes (Bart) - Discard support for brd (Keith) - blk-cgroup list corruption fixes (Ming) - blk-cgroup stat propagation fix (Waiman) - Regression fix for plugging stall with md (Yu) - Misc fixes or cleanups (David, Jeff, Justin)" * tag 'block-6.10-20240523' of git://git.kernel.dk/linux: (24 commits) null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues' blk-throttle: remove unused struct 'avg_latency_bucket' block: fix lost bio for plug enabled bio based device block: t10-pi: add MODULE_DESCRIPTION() blk-mq: add helper for checking if one CPU is mapped to specified hctx blk-cgroup: Properly propagate the iostat update up the hierarchy blk-cgroup: fix list corruption from reorder of WRITE ->lqueued blk-cgroup: fix list corruption from resetting io stat cdrom: rearrange last_media_change check to avoid unintentional overflow nbd: Fix signal handling nbd: Remove a local variable from nbd_send_cmd() nbd: Improve the documentation of the locking assumptions nbd: Remove superfluous casts nbd: Use NULL to represent a pointer brd: implement discard support null_blk: Fix two sparse warnings ublk_drv: set DMA alignment mask to 3 nvme-rdma, nvme-tcp: include max reconnects for reconnect logging nvmet-rdma: Avoid o(n^2) loop in delete_ctrl nvme: do not retry authentication failures ...
2024-05-23nvmet: fix ns enable/disable possible hangSagi Grimberg1-0/+8
When disabling an nvmet namespace, there is a period where the subsys->lock is released, as the ns disable waits for backend IO to complete, and the ns percpu ref to be properly killed. The original intent was to avoid taking the subsystem lock for a prolong period as other processes may need to acquire it (for example new incoming connections). However, it opens up a window where another process may come in and enable the ns, (re)intiailizing the ns percpu_ref, causing the disable sequence to hang. Solve this by taking the global nvmet_config_sem over the entire configfs enable/disable sequence. Fixes: a07b4970f464 ("nvmet: add a generic NVMe target") Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-05-23nvme-multipath: fix io accounting on failoverKeith Busch3-2/+4
There are io stats accounting that needs to be handled, so don't call blk_mq_end_request() directly. Use the existing nvme_end_req() helper that already handles everything. Fixes: d4d957b53d91ee ("nvme-multipath: support io stats on the mpath device") Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-05-23nvme: fix multipath batched completion accountingKeith Busch1-5/+10
Batched completions were missing the io stats accounting and bio trace events. Move the common code to a helper and call it from the batched and non-batched functions. Fixes: d4d957b53d91ee ("nvme-multipath: support io stats on the mpath device") Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-05-23Merge tag 'io_uring-6.10-20240523' of git://git.kernel.dk/linuxLinus Torvalds2-6/+6
Pull io_uring fixes from Jens Axboe: "Single fix here for a regression in 6.9, and then a simple cleanup removing some dead code" * tag 'io_uring-6.10-20240523' of git://git.kernel.dk/linux: io_uring: remove checks for NULL 'sq_offset' io_uring/sqpoll: ensure that normal task_work is also run timely
2024-05-23Merge tag 'regulator-fix-v6.10-merge-window' of ↵Linus Torvalds6-77/+48
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "A bunch of fixes that came in during the merge window. Matti found several issues with some of the more complexly configured Rohm regulators and the helpers they use and there were some errors in the specification of tps6594 when regulators are grouped together" * tag 'regulator-fix-v6.10-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: tps6594-regulator: Correct multi-phase configuration regulator: tps6287x: Force writing VSEL bit regulator: pickable ranges: don't always cache vsel regulator: rohm-regulator: warn if unsupported voltage is set regulator: bd71828: Don't overwrite runtime voltages
2024-05-23Merge tag 'regmap-fix-v6.10-merge-window' of ↵Linus Torvalds1-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap Pull regmap fix from Mark Brown: "Guenter ran with memory sanitisers and found an issue in the new KUnit tests that Richard added where an assumption in older test code was exposed, this was fixed quickly by Richard" * tag 'regmap-fix-v6.10-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: regmap: kunit: Fix array overflow in stride() test
2024-05-23genirq/cpuhotplug, x86/vector: Prevent vector leak during CPU offlineDongli Zhang2-11/+14
The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of interrupt affinity reconfiguration via procfs. Instead, the change is deferred until the next instance of the interrupt being triggered on the original CPU. When the interrupt next triggers on the original CPU, the new affinity is enforced within __irq_move_irq(). A vector is allocated from the new CPU, but the old vector on the original CPU remains and is not immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the reclaiming process is delayed until the next trigger of the interrupt on the new CPU. Upon the subsequent triggering of the interrupt on the new CPU, irq_complete_move() adds a task to the old CPU's vector_cleanup list if it remains online. Subsequently, the timer on the old CPU iterates over its vector_cleanup list, reclaiming old vectors. However, a rare scenario arises if the old CPU is outgoing before the interrupt triggers again on the new CPU. In that case irq_force_complete_move() is not invoked on the outgoing CPU to reclaim the old apicd->prev_vector because the interrupt isn't currently affine to the outgoing CPU, and irq_needs_fixup() returns false. Even though __vector_schedule_cleanup() is later called on the new CPU, it doesn't reclaim apicd->prev_vector; instead, it simply resets both apicd->move_in_progress and apicd->prev_vector to 0. As a result, the vector remains unreclaimed in vector_matrix, leading to a CPU vector leak. To address this issue, move the invocation of irq_force_complete_move() before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the interrupt is currently or used to be affine to the outgoing CPU. Additionally, reclaim the vector in __vector_schedule_cleanup() as well, following a warning message, although theoretically it should never see apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU. Fixes: f0383c24b485 ("genirq/cpuhotplug: Add support for cleaning up move in progress") Signed-off-by: Dongli Zhang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-05-23Merge tag 'net-6.10-rc1' of ↵Linus Torvalds26-245/+243
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Quite smaller than usual. Notably it includes the fix for the unix regression from the past weeks. The TCP window fix will require some follow-up, already queued. Current release - regressions: - af_unix: fix garbage collection of embryos Previous releases - regressions: - af_unix: fix race between GC and receive path - ipv6: sr: fix missing sk_buff release in seg6_input_core - tcp: remove 64 KByte limit for initial tp->rcv_wnd value - eth: r8169: fix rx hangup - eth: lan966x: remove ptp traps in case the ptp is not enabled - eth: ixgbe: fix link breakage vs cisco switches - eth: ice: prevent ethtool from corrupting the channels Previous releases - always broken: - openvswitch: set the skbuff pkt_type for proper pmtud support - tcp: Fix shift-out-of-bounds in dctcp_update_alpha() Misc: - a bunch of selftests stabilization patches" * tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits) r8169: Fix possible ring buffer corruption on fragmented Tx packets. idpf: Interpret .set_channels() input differently ice: Interpret .set_channels() input differently nfc: nci: Fix handling of zero-length payload packets in nci_rx_work() net: relax socket state check at accept time. tcp: remove 64 KByte limit for initial tp->rcv_wnd value net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe() tls: fix missing memory barrier in tls_init net: fec: avoid lock evasion when reading pps_enable Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI" testing: net-drv: use stats64 for testing net: mana: Fix the extra HZ in mana_hwc_send_request net: lan966x: Remove ptp traps in case the ptp is not enabled. openvswitch: Set the skbuff pkt_type for proper pmtud support. selftest: af_unix: Make SCM_RIGHTS into OOB data. af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). selftests/net: use tc rule to filter the na packet ipv6: sr: fix memleak in seg6_hmac_init_algo af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. ...
2024-05-23Merge tag 'trace-fixes-v6.10' of ↵Linus Torvalds3-13/+15
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "Minor last minute fixes: - Fix a very tight race between the ring buffer readers and resizing the ring buffer - Correct some stale comments in the ring buffer code - Fix kernel-doc in the rv code - Add a MODULE_DESCRIPTION to preemptirq_delay_test" * tag 'trace-fixes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Update rv_en(dis)able_monitor doc to match kernel-doc tracing: Add MODULE_DESCRIPTION() to preemptirq_delay_test ring-buffer: Fix a race between readers and resize checks ring-buffer: Correct stale comments related to non-consuming readers
2024-05-23Merge tag 'trace-tools-v6.10-2' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing tool fix from Steven Rostedt: "Fix printf format warnings in latency-collector. Use the printf format string with %s to take a string instead of taking in a string directly" * tag 'trace-tools-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tools/latency-collector: Fix -Wformat-security compile warns
2024-05-23Merge tag 'trace-assign-str-v6.10' of ↵Linus Torvalds147-808/+794
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing cleanup from Steven Rostedt: "Remove second argument of __assign_str() The __assign_str() macro logic of the TRACE_EVENT() macro was optimized so that it no longer needs the second argument. The __assign_str() is always matched with __string() field that takes a field name and the source for that field: __string(field, source) The TRACE_EVENT() macro logic will save off the source value and then use that value to copy into the ring buffer via the __assign_str(). Before commit c1fa617caeb0 ("tracing: Rework __assign_str() and __string() to not duplicate getting the string"), the __assign_str() needed the second argument which would perform the same logic as the __string() source parameter did. Not only would this add overhead, but it was error prone as if the __assign_str() source produced something different, it may not have allocated enough for the string in the ring buffer (as the __string() source was used to determine how much to allocate) Now that the __assign_str() just uses the same string that was used in __string() it no longer needs the source parameter. It can now be removed" * tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/treewide: Remove second parameter of __assign_str()
2024-05-23Merge tag 'sparc-for-6.10-tag1' of ↵Linus Torvalds24-114/+76
git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc Pull sparc updates from Andreas Larsson: - Avoid on-stack cpumask variables in a number of places - Move struct termio to asm/termios.h, matching other architectures and allowing certain user space applications to build also for sparc - Fix missing prototype warnings for sparc64 - Fix version generation warnings for sparc32 - Fix bug where non-consecutive CPU IDs lead to some CPUs not starting - Simplification using swap and cleanup using NULL for pointer - Convert sparc parport and chmc drivers to use remove callbacks returning void * tag 'sparc-for-6.10-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc: sparc/leon: Remove on-stack cpumask var sparc/pci_msi: Remove on-stack cpumask var sparc/of: Remove on-stack cpumask var sparc/irq: Remove on-stack cpumask var sparc/srmmu: Remove on-stack cpumask var sparc: chmc: Convert to platform remove callback returning void sparc: parport: Convert to platform remove callback returning void sparc: Compare pointers to NULL instead of 0 sparc: Use swap() to fix Coccinelle warning sparc32: Fix version generation failed warnings sparc64: Fix number of online CPUs sparc64: Fix prototype warning for sched_clock sparc64: Fix prototype warnings in adi_64.c sparc64: Fix prototype warning for dma_4v_iotsb_bind sparc64: Fix prototype warning for uprobe_trap sparc64: Fix prototype warning for alloc_irqstack_bootmem sparc64: Fix prototype warning for vmemmap_free sparc64: Fix prototype warnings in traps_64.c sparc64: Fix prototype warning for init_vdso_image sparc: move struct termio to asm/termios.h
2024-05-23Merge tag 'arm64-fixes' of ↵Linus Torvalds12-25/+125
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The major fix here is for a filesystem corruption issue reported on Apple M1 as a result of buggy management of the floating point register state introduced in 6.8. I initially reverted one of the offending patches, but in the end Ard cooked a proper fix so there's a revert+reapply in the series. Aside from that, we've got some CPU errata workarounds and misc other fixes. - Fix broken FP register state tracking which resulted in filesystem corruption when dm-crypt is used - Workarounds for Arm CPU errata affecting the SSBS Spectre mitigation - Fix lockdep assertion in DMC620 memory controller PMU driver - Fix alignment of BUG table when CONFIG_DEBUG_BUGVERBOSE is disabled" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64/fpsimd: Avoid erroneous elide of user state reload Reapply "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD" arm64: asm-bug: Add .align 2 to the end of __BUG_ENTRY perf/arm-dmc620: Fix lockdep assert in ->event_init() Revert "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD" arm64: errata: Add workaround for Arm errata 3194386 and 3312417 arm64: cputype: Add Neoverse-V3 definitions arm64: cputype: Add Cortex-X4 definitions arm64: barrier: Restore spec_bar() macro
2024-05-23Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds40-175/+353
Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-net is finally supported in vduse - virtio (balloon and mem) interaction with suspend is improved - vhost-scsi now handles signals better/faster And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-pci: Check if is_avq is NULL virtio: delete vq in vp_find_vqs_msix() when request_irq() fails MAINTAINERS: add Eugenio Pérez as reviewer vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API vp_vdpa: don't allocate unused msix vectors sound: virtio: drop owner assignment fuse: virtio: drop owner assignment scsi: virtio: drop owner assignment rpmsg: virtio: drop owner assignment nvdimm: virtio_pmem: drop owner assignment wifi: mac80211_hwsim: drop owner assignment vsock/virtio: drop owner assignment net: 9p: virtio: drop owner assignment net: virtio: drop owner assignment net: caif: virtio: drop owner assignment misc: nsm: drop owner assignment iommu: virtio: drop owner assignment drm/virtio: drop owner assignment gpio: virtio: drop owner assignment firmware: arm_scmi: virtio: drop owner assignment ...
2024-05-23irqchip/riscv-imsic: Fixup riscv_ipi_set_virq_range() conflictPalmer Dabbelt1-1/+1
There was a semantic conflict between 21a8f8a0eb35 ("irqchip: Add RISC-V incoming MSI controller early driver") and dc892fb44322 ("riscv: Use IPIs for remote cache/TLB flushes by default") due to an API change. This manifests as a build failure post-merge. Fixes: 0bfbc914d943 ("Merge tag 'riscv-for-linus-6.10-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux") Reported-by: Tomasz Jeznach <[email protected]> Signed-off-by: Palmer Dabbelt <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/mhng-10b71228-cf3e-42ca-9abf-5464b15093f1@palmer-ri-x1c9/
2024-05-23riscv: Fix early ftrace nop patchingAlexandre Ghiti2-0/+9
Commit c97bf629963e ("riscv: Fix text patching when IPI are used") converted ftrace_make_nop() to use patch_insn_write() which does not emit any icache flush relying entirely on __ftrace_modify_code() to do that. But we missed that ftrace_make_nop() was called very early directly when converting mcount calls into nops (actually on riscv it converts 2B nops emitted by the compiler into 4B nops). This caused crashes on multiple HW as reported by Conor and Björn since the booting core could have half-patched instructions in its icache which would trigger an illegal instruction trap: fix this by emitting a local flush icache when early patching nops. Fixes: c97bf629963e ("riscv: Fix text patching when IPI are used") Signed-off-by: Alexandre Ghiti <[email protected]> Reported-by: Conor Dooley <[email protected]> Tested-by: Conor Dooley <[email protected]> Reviewed-by: Björn Töpel <[email protected]> Tested-by: Björn Töpel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2024-05-23Merge tag 'drm-misc-fixes-2024-05-16' of ↵Daniel Vetter11-85/+112
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next Short summary of fixes pull: nouveau: - use tile_mode and pte_kind for VM_BIND bo allocations Signed-off-by: Daniel Vetter <[email protected]> From: Thomas Zimmermann <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-05-23tools/latency-collector: Fix -Wformat-security compile warnsShuah Khan1-4/+4
Fix the following -Wformat-security compile warnings adding missing format arguments: latency-collector.c: In function ‘show_available’: latency-collector.c:938:17: warning: format not a string literal and no format arguments [-Wformat-security] 938 | warnx(no_tracer_msg); | ^~~~~ latency-collector.c:943:17: warning: format not a string literal and no format arguments [-Wformat-security] 943 | warnx(no_latency_tr_msg); | ^~~~~ latency-collector.c: In function ‘find_default_tracer’: latency-collector.c:986:25: warning: format not a string literal and no format arguments [-Wformat-security] 986 | errx(EXIT_FAILURE, no_tracer_msg); | ^~~~ latency-collector.c: In function ‘scan_arguments’: latency-collector.c:1881:33: warning: format not a string literal and no format arguments [-Wformat-security] 1881 | errx(EXIT_FAILURE, no_tracer_msg); | ^~~~ Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Fixes: e23db805da2df ("tracing/tools: Add the latency-collector to tools directory") Signed-off-by: Shuah Khan <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-05-23soi: Don't call DMA sync API when not neededMark Brown63-933/+2241
Merge series from Andy Shevchenko <[email protected]>: A couple of fixes to avoid calling DMA sync API when it's not needed. This doesn't stop from discussing if IOMMU code is doing the right thing, i.e. dereferences SG list when orig_nents == 0, but this is a separate story.
2024-05-23r8169: Fix possible ring buffer corruption on fragmented Tx packets.Ken Milmore1-1/+2
An issue was found on the RTL8125b when transmitting small fragmented packets, whereby invalid entries were inserted into the transmit ring buffer, subsequently leading to calls to dma_unmap_single() with a null address. This was caused by rtl8169_start_xmit() not noticing changes to nr_frags which may occur when small packets are padded (to work around hardware quirks) in rtl8169_tso_csum_v2(). To fix this, postpone inspecting nr_frags until after any padding has been applied. Fixes: 9020845fb5d6 ("r8169: improve rtl8169_start_xmit") Cc: [email protected] Signed-off-by: Ken Milmore <[email protected]> Reviewed-by: Heiner Kallweit <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2024-05-23eventfs: Do not use attributes for events directorySteven Rostedt (Google)1-7/+7
The top "events" directory has a static inode (it's created when it is and removed when the directory is removed). There's no need to use the events ei->attr to determine its permissions. But it is used for saving the permissions of the "events" directory for when it is created, as that is needed for the default permissions for the files and directories underneath it. For example: # cd /sys/kernel/tracing # mkdir instances/foo # chown 1001 instances/foo/events The files under instances/foo/events should still have the same owner as instances/foo (which the instances/foo/events ei->attr will hold), but the events directory now has owner 1001. Link: https://lore.kernel.org/lkml/[email protected] Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-05-23eventfs: Cleanup permissions in creation of inodesSteven Rostedt (Google)1-67/+23
The permissions being set during the creation of the inodes was updating eventfs_inode attributes as well. Those attributes should only be touched by the setattr or remount operations, not during the creation of inodes. The eventfs_inode attributes should only be used to set the inodes and should not be modified during the inode creation. Simplify the code and fix the situation by: 1) Removing the eventfs_find_events() and doing a simple lookup for the events descriptor in eventfs_get_inode() 2) Remove update_events_attr() as the attributes should only be used to update the inode and should not be modified here. 3) Add update_inode_attr() that uses the attributes to determine what the inode permissions should be. 4) As the parent_inode of the eventfs_root_inode structure is no longer needed, remove it. Now on creation, the inode gets the proper permissions without causing side effects to the ei->attr field. Link: https://lore.kernel.org/lkml/[email protected] Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-05-23eventfs: Remove getattr and permission callbacksSteven Rostedt (Google)1-40/+0
Now that inodes have their permissions updated on remount, the only other places to update the inode permissions are when they are created and in the setattr callback. The getattr and permission callbacks are not needed as the inodes should already be set at their proper settings. Remove the callbacks, as it not only simplifies the code, but also allows more flexibility to fix the inconsistencies with various corner cases (like changing the permission of an instance directory). Link: https://lore.kernel.org/lkml/[email protected] Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-05-23eventfs: Consolidate the eventfs_inode update in eventfs_get_inode()Steven Rostedt (Google)1-19/+23
To simplify the code, create a eventfs_get_inode() that is used when an eventfs file or directory is created. Have the internal tracefs_inode updated the appropriate flags in this function and update the inode's mode as well. Link: https://lore.kernel.org/lkml/[email protected] Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-05-23tracefs: Clear EVENT_INODE flag in tracefs_drop_inode()Steven Rostedt (Google)1-16/+17
When the inode is being dropped from the dentry, the TRACEFS_EVENT_INODE flag needs to be cleared to prevent a remount from calling eventfs_remount() on the tracefs_inode private data. There's a race between the inode is dropped (and the dentry freed) to where the inode is actually freed. If a remount happens between the two, the eventfs_inode could be accessed after it is freed (only the dentry keeps a ref count on it). Currently the TRACEFS_EVENT_INODE flag is cleared from the dentry iput() function. But this is incorrect, as it is possible that the inode has another reference to it. The flag should only be cleared when the inode is really being dropped and has no more references. That happens in the drop_inode callback of the inode, as that gets called when the last reference of the inode is released. Remove the tracefs_d_iput() function and move its logic to the more appropriate tracefs_drop_inode() callback function. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Masahiro Yamada <[email protected]> Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-05-23eventfs: Update all the eventfs_inodes from the events descriptorSteven Rostedt (Google)1-13/+31
The change to update the permissions of the eventfs_inode had the misconception that using the tracefs_inode would find all the eventfs_inodes that have been updated and reset them on remount. The problem with this approach is that the eventfs_inodes are freed when they are no longer used (basically the reason the eventfs system exists). When they are freed, the updated eventfs_inodes are not reset on a remount because their tracefs_inodes have been freed. Instead, since the events directory eventfs_inode always has a tracefs_inode pointing to it (it is not freed when finished), and the events directory has a link to all its children, have the eventfs_remount() function only operate on the events eventfs_inode and have it descend into its children updating their uid and gids. Link: https://lore.kernel.org/all/CAK7LNARXgaWw3kH9JgrnH4vK6fr8LDkNKf3wq8NhMWJrVwJyVQ@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Reported-by: Masahiro Yamada <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-05-23tracefs: Update inode permissions on remountSteven Rostedt (Google)2-7/+25
When a remount happens, if a gid or uid is specified update the inodes to have the same gid and uid. This will allow the simplification of the permissions logic for the dynamically created files and directories. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Masahiro Yamada <[email protected]> Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-05-23eventfs: Keep the directories from having the same inode number as filesSteven Rostedt (Google)1-1/+5
The directories require unique inode numbers but all the eventfs files have the same inode number. Prevent the directories from having the same inode numbers as the files as that can confuse some tooling. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Masahiro Yamada <[email protected]> Fixes: 834bf76add3e6 ("eventfs: Save directory inodes in the eventfs_inode structure") Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-05-23dma-mapping: benchmark: handle NUMA_NO_NODE correctlyFedor Pchelkin1-2/+1
cpumask_of_node() can be called for NUMA_NO_NODE inside do_map_benchmark() resulting in the following sanitizer report: UBSAN: array-index-out-of-bounds in ./arch/x86/include/asm/topology.h:72:28 index -1 is out of range for type 'cpumask [64][1]' CPU: 1 PID: 990 Comm: dma_map_benchma Not tainted 6.9.0-rc6 #29 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:117) ubsan_epilogue (lib/ubsan.c:232) __ubsan_handle_out_of_bounds (lib/ubsan.c:429) cpumask_of_node (arch/x86/include/asm/topology.h:72) [inline] do_map_benchmark (kernel/dma/map_benchmark.c:104) map_benchmark_ioctl (kernel/dma/map_benchmark.c:246) full_proxy_unlocked_ioctl (fs/debugfs/file.c:333) __x64_sys_ioctl (fs/ioctl.c:890) do_syscall_64 (arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Use cpumask_of_node() in place when binding a kernel thread to a cpuset of a particular node. Note that the provided node id is checked inside map_benchmark_ioctl(). It's just a NUMA_NO_NODE case which is not handled properly later. Found by Linux Verification Center (linuxtesting.org). Fixes: 65789daa8087 ("dma-mapping: add benchmark support for streaming DMA APIs") Signed-off-by: Fedor Pchelkin <[email protected]> Acked-by: Barry Song <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-23dma-mapping: benchmark: fix node id validationFedor Pchelkin1-1/+2
While validating node ids in map_benchmark_ioctl(), node_possible() may be provided with invalid argument outside of [0,MAX_NUMNODES-1] range leading to: BUG: KASAN: wild-memory-access in map_benchmark_ioctl (kernel/dma/map_benchmark.c:214) Read of size 8 at addr 1fffffff8ccb6398 by task dma_map_benchma/971 CPU: 7 PID: 971 Comm: dma_map_benchma Not tainted 6.9.0-rc6 #37 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:117) kasan_report (mm/kasan/report.c:603) kasan_check_range (mm/kasan/generic.c:189) variable_test_bit (arch/x86/include/asm/bitops.h:227) [inline] arch_test_bit (arch/x86/include/asm/bitops.h:239) [inline] _test_bit at (include/asm-generic/bitops/instrumented-non-atomic.h:142) [inline] node_state (include/linux/nodemask.h:423) [inline] map_benchmark_ioctl (kernel/dma/map_benchmark.c:214) full_proxy_unlocked_ioctl (fs/debugfs/file.c:333) __x64_sys_ioctl (fs/ioctl.c:890) do_syscall_64 (arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Compare node ids with sane bounds first. NUMA_NO_NODE is considered a special valid case meaning that benchmarking kthreads won't be bound to a cpuset of a given node. Found by Linux Verification Center (linuxtesting.org). Fixes: 65789daa8087 ("dma-mapping: add benchmark support for streaming DMA APIs") Signed-off-by: Fedor Pchelkin <[email protected]> Reviewed-by: Robin Murphy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-23dma-mapping: benchmark: avoid needless copy_to_user if benchmark failsFedor Pchelkin1-0/+3
If do_map_benchmark() has failed, there is nothing useful to copy back to userspace. Suggested-by: Barry Song <[email protected]> Signed-off-by: Fedor Pchelkin <[email protected]> Acked-by: Robin Murphy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-23dma-mapping: benchmark: fix up kthread-related error handlingFedor Pchelkin1-6/+10
kthread creation failure is invalidly handled inside do_map_benchmark(). The put_task_struct() calls on the error path are supposed to balance the get_task_struct() calls which only happen after all the kthreads are successfully created. Rollback using kthread_stop() for already created kthreads in case of such failure. In normal situation call kthread_stop_put() to gracefully stop kthreads and put their task refcounts. This should be done for all started kthreads. Found by Linux Verification Center (linuxtesting.org). Fixes: 65789daa8087 ("dma-mapping: add benchmark support for streaming DMA APIs") Suggested-by: Robin Murphy <[email protected]> Signed-off-by: Fedor Pchelkin <[email protected]> Reviewed-by: Robin Murphy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-23null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues'Yu Kuai1-14/+26
Writing 'power' and 'submit_queues' concurrently will trigger kernel panic: Test script: modprobe null_blk nr_devices=0 mkdir -p /sys/kernel/config/nullb/nullb0 while true; do echo 1 > submit_queues; echo 4 > submit_queues; done & while true; do echo 1 > power; echo 0 > power; done Test result: BUG: kernel NULL pointer dereference, address: 0000000000000148 Oops: 0000 [#1] PREEMPT SMP RIP: 0010:__lock_acquire+0x41d/0x28f0 Call Trace: <TASK> lock_acquire+0x121/0x450 down_write+0x5f/0x1d0 simple_recursive_removal+0x12f/0x5c0 blk_mq_debugfs_unregister_hctxs+0x7c/0x100 blk_mq_update_nr_hw_queues+0x4a3/0x720 nullb_update_nr_hw_queues+0x71/0xf0 [null_blk] nullb_device_submit_queues_store+0x79/0xf0 [null_blk] configfs_write_iter+0x119/0x1e0 vfs_write+0x326/0x730 ksys_write+0x74/0x150 This is because del_gendisk() can concurrent with blk_mq_update_nr_hw_queues(): nullb_device_power_store nullb_apply_submit_queues null_del_dev del_gendisk nullb_update_nr_hw_queues if (!dev->nullb) // still set while gendisk is deleted return 0 blk_mq_update_nr_hw_queues dev->nullb = NULL Fix this problem by resuing the global mutex to protect nullb_device_power_store() and nullb_update_nr_hw_queues() from configfs. Fixes: 45919fbfe1c4 ("null_blk: Enable modifying 'submit_queues' after an instance has been configured") Reported-and-tested-by: Yi Zhang <[email protected]> Closes: https://lore.kernel.org/all/CAHj4cs9LgsHLnjg8z06LQ3Pr5cax-+Ps+xT7AP7TPnEjStuwZA@mail.gmail.com/ Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Zhu Yanjun <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-05-23irqchip: riscv-imsic: Fixup riscv_ipi_set_virq_range() conflictPalmer Dabbelt1-1/+1
There was a semantic conflict between 21a8f8a0eb35 ("irqchip: Add RISC-V incoming MSI controller early driver") and dc892fb44322 ("riscv: Use IPIs for remote cache/TLB flushes by default") due to an API change. This manifests as a build failure post-merge. Reported-by: Tomasz Jeznach <[email protected]> Link: https://lore.kernel.org/all/mhng-10b71228-cf3e-42ca-9abf-5464b15093f1@palmer-ri-x1c9/ Fixes: 0bfbc914d943 ("Merge tag 'riscv-for-linus-6.10-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux") Reviewed-by: Anup Patel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2024-05-23spi: stm32: Revert change that enabled controller before asserting CSUwe Kleine-König1-2/+12
On stm32mp157 enabling the controller before asserting CS makes the hardware trigger spurious interrupts in a tight loop and the transfers fail. Revert the commit that swapped the order of enable and CS. This reintroduces the problem that swapping was supposed to fix, which however is less grave. Reported-by: Leonard Göhrs <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Fixes: 52b62e7a5d4f ("spi: stm32: enable controller before asserting CS") Signed-off-by: Uwe Kleine-König <[email protected]> Link: https://msgid.link/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2024-05-23spi: Check if transfer is mapped before calling DMA sync APIsAndy Shevchenko1-5/+13
The resent update to remove the orig_nents checks revealed that not all DMA sync backends can cope with the unallocated SG list, while supplying orig_nents == 0 (the commit 861370f49ce4 ("iommu/dma: force bouncing if the size is not cacheline-aligned"), for example, makes that happen for the IOMMU case). It means we have to check if the buffers are DMA mapped before trying to sync them. Re-introduce that check in a form of calling ->can_dma() in the same way as it's done in the DMA mapping loop for the SPI transfers. Reported-by: Nícolas F. R. A. Prado <[email protected]> Reported-by: Neil Armstrong <[email protected]> Closes: https://lore.kernel.org/r/[email protected] Closes: https://lore.kernel.org/all/d3679496-2e4e-4a7c-97ed-f193bd53af1d@notapiano Fixes: 8cc3bad9d9d6 ("spi: Remove unneded check for orig_nents") Suggested-by: Nícolas F. R. A. Prado <[email protected]> Tested-by: Nícolas F. R. A. Prado <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]> Link: https://msgid.link/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2024-05-23spi: Don't mark message DMA mapped when no transfer in it isAndy Shevchenko1-0/+4
There is no need to set the DMA mapped flag of the message if it has no mapped transfers. Moreover, it may give the code a chance to take the wrong paths, i.e. to exercise DMA related APIs on unmapped data. Make __spi_map_msg() to bail earlier on the above mentioned cases. Fixes: 99adef310f68 ("spi: Provide core support for DMA mapping transfers") Signed-off-by: Andy Shevchenko <[email protected]> Link: https://msgid.link/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2024-05-23Merge tag 'asoc-fix-v6.10-merge-window' of ↵Takashi Iwai8-89/+51
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Fixes for v6.10 A bunch of fixes that came in during the merge window, all driver specific and none of them especially remarkable.
2024-05-239p: add missing locking around taking dentry fid listDominique Martinet1-2/+7
Fix a use-after-free on dentry's d_fsdata fid list when a thread looks up a fid through dentry while another thread unlinks it: UAF thread: refcount_t: addition on 0; use-after-free. p9_fid_get linux/./include/net/9p/client.h:262 v9fs_fid_find+0x236/0x280 linux/fs/9p/fid.c:129 v9fs_fid_lookup_with_uid linux/fs/9p/fid.c:181 v9fs_fid_lookup+0xbf/0xc20 linux/fs/9p/fid.c:314 v9fs_vfs_getattr_dotl+0xf9/0x360 linux/fs/9p/vfs_inode_dotl.c:400 vfs_statx+0xdd/0x4d0 linux/fs/stat.c:248 Freed by: p9_fid_destroy (inlined) p9_client_clunk+0xb0/0xe0 linux/net/9p/client.c:1456 p9_fid_put linux/./include/net/9p/client.h:278 v9fs_dentry_release+0xb5/0x140 linux/fs/9p/vfs_dentry.c:55 v9fs_remove+0x38f/0x620 linux/fs/9p/vfs_inode.c:518 vfs_unlink+0x29a/0x810 linux/fs/namei.c:4335 The problem is that d_fsdata was not accessed under d_lock, because d_release() normally is only called once the dentry is otherwise no longer accessible but since we also call it explicitly in v9fs_remove that lock is required: move the hlist out of the dentry under lock then unref its fids once they are no longer accessible. Fixes: 154372e67d40 ("fs/9p: fix create-unlink-getattr idiom") Cc: [email protected] Reported-by: Meysam Firouzi Reported-by: Amirmohammad Eftekhar Reviewed-by: Christian Schoenebeck <[email protected]> Message-ID: <[email protected]> Signed-off-by: Dominique Martinet <[email protected]>
2024-05-23Merge branch 'intel-interpret-set_channels-input-differently'Paolo Abeni2-32/+8
Jacob Keller says: ==================== intel: Interpret .set_channels() input differently The ice and idpf drivers can trigger a crash with AF_XDP due to incorrect interpretation of the asymmetric Tx and Rx parameters in their .set_channels() implementations: 1. ethtool -l <IFNAME> -> combined: 40 2. Attach AF_XDP to queue 30 3. ethtool -L <IFNAME> rx 15 tx 15 combined number is not specified, so command becomes {rx_count = 15, tx_count = 15, combined_count = 40}. 4. ethnl_set_channels checks, if there are any AF_XDP of queues from the new (combined_count + rx_count) to the old one, so from 55 to 40, check does not trigger. 5. the driver interprets `rx 15 tx 15` as 15 combined channels and deletes the queue that AF_XDP is attached to. This is fundamentally a problem with interpreting a request for asymmetric queues as symmetric combined queues. Fix the ice and idpf drivers to stop interpreting such requests as a request for combined queues. Due to current driver design for both ice and idpf, it is not possible to support requests of the same count of Tx and Rx queues with independent interrupts, (i.e. ethtool -L <IFNAME> rx 15 tx 15) so such requests are now rejected. Signed-off-by: Jacob Keller <[email protected]> ==================== Link: https://lore.kernel.org/r/20240521-iwl-net-2024-05-14-set-channels-fixes-v2-0-7aa39e2e99f1@intel.com Signed-off-by: Paolo Abeni <[email protected]>
2024-05-23idpf: Interpret .set_channels() input differentlyLarysa Zaremba1-15/+6
Unlike ice, idpf does not check, if user has requested at least 1 combined channel. Instead, it relies on a check in the core code. Unfortunately, the check does not trigger for us because of the hacky .set_channels() interpretation logic that is not consistent with the core code. This naturally leads to user being able to trigger a crash with an invalid input. This is how: 1. ethtool -l <IFNAME> -> combined: 40 2. ethtool -L <IFNAME> rx 0 tx 0 combined number is not specified, so command becomes {rx_count = 0, tx_count = 0, combined_count = 40}. 3. ethnl_set_channels checks, if there is at least 1 RX and 1 TX channel, comparing (combined_count + rx_count) and (combined_count + tx_count) to zero. Obviously, (40 + 0) is greater than zero, so the core code deems the input OK. 4. idpf interprets `rx 0 tx 0` as 0 channels and tries to proceed with such configuration. The issue has to be solved fundamentally, as current logic is also known to cause AF_XDP problems in ice [0]. Interpret the command in a way that is more consistent with ethtool manual [1] (--show-channels and --set-channels) and new ice logic. Considering that in the idpf driver only the difference between RX and TX queues forms dedicated channels, change the correct way to set number of channels to: ethtool -L <IFNAME> combined 10 /* For symmetric queues */ ethtool -L <IFNAME> combined 8 tx 2 rx 0 /* For asymmetric queues */ [0] https://lore.kernel.org/netdev/[email protected]/ [1] https://man7.org/linux/man-pages/man8/ethtool.8.html Fixes: 02cbfba1add5 ("idpf: add ethtool callbacks") Reviewed-by: Przemek Kitszel <[email protected]> Reviewed-by: Igor Bagnucki <[email protected]> Signed-off-by: Larysa Zaremba <[email protected]> Tested-by: Krishneil Singh <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2024-05-23ice: Interpret .set_channels() input differentlyLarysa Zaremba1-17/+2
A bug occurs because a safety check guarding AF_XDP-related queues in ethnl_set_channels(), does not trigger. This happens, because kernel and ice driver interpret the ethtool command differently. How the bug occurs: 1. ethtool -l <IFNAME> -> combined: 40 2. Attach AF_XDP to queue 30 3. ethtool -L <IFNAME> rx 15 tx 15 combined number is not specified, so command becomes {rx_count = 15, tx_count = 15, combined_count = 40}. 4. ethnl_set_channels checks, if there are any AF_XDP of queues from the new (combined_count + rx_count) to the old one, so from 55 to 40, check does not trigger. 5. ice interprets `rx 15 tx 15` as 15 combined channels and deletes the queue that AF_XDP is attached to. Interpret the command in a way that is more consistent with ethtool manual [0] (--show-channels and --set-channels). Considering that in the ice driver only the difference between RX and TX queues forms dedicated channels, change the correct way to set number of channels to: ethtool -L <IFNAME> combined 10 /* For symmetric queues */ ethtool -L <IFNAME> combined 8 tx 2 rx 0 /* For asymmetric queues */ [0] https://man7.org/linux/man-pages/man8/ethtool.8.html Fixes: 87324e747fde ("ice: Implement ethtool ops for channels") Reviewed-by: Michal Swiatkowski <[email protected]> Signed-off-by: Larysa Zaremba <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Acked-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2024-05-23drivers/xen: Improve the late XenStore init protocolHenry Wang1-13/+23
Currently, the late XenStore init protocol is only triggered properly for the case that HVM_PARAM_STORE_PFN is ~0ULL (invalid). For the case that XenStore interface is allocated but not ready (the connection status is not XENSTORE_CONNECTED), Linux should also wait until the XenStore is set up properly. Introduce a macro to describe the XenStore interface is ready, use it in xenbus_probe_initcall() to select the code path of doing the late XenStore init protocol or not. Since now we have more than one condition for XenStore late init, rework the check in xenbus_probe() for the free_irq(). Take the opportunity to enhance the check of the allocated XenStore interface can be properly mapped, and return error early if the memremap() fails. Fixes: 5b3353949e89 ("xen: add support for initializing xenstore later as HVM domain") Signed-off-by: Henry Wang <[email protected]> Signed-off-by: Michal Orzel <[email protected]> Reviewed-by: Stefano Stabellini <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>