aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-04-05perf/core: Fix the same task check in perf_event_set_outputKan Liang1-1/+1
The same task check in perf_event_set_output has some potential issues for some usages. For the current perf code, there is a problem if using of perf_event_open() to have multiple samples getting into the same mmap’d memory when they are both attached to the same process. https://lore.kernel.org/all/[email protected]/ Because the event->ctx is not ready when the perf_event_set_output() is invoked in the perf_event_open(). Besides the above issue, before the commit bd2756811766 ("perf: Rewrite core context handling"), perf record can errors out when sampling with a hardware event and a software event as below. $ perf record -e cycles,dummy --per-thread ls failed to mmap with 22 (Invalid argument) That's because that prior to the commit a hardware event and a software event are from different task context. The problem should be a long time issue since commit c3f00c70276d ("perk: Separate find_get_context() from event initialization"). The task struct is stored in the event->hw.target for each per-thread event. It is a more reliable way to determine whether two events are attached to the same task. The event->hw.target was also introduced several years ago by the commit 50f16a8bf9d7 ("perf: Remove type specific target pointers"). It can not only be used to fix the issue with the current code, but also back port to fix the issues with an older kernel. Note: The event->hw.target was introduced later than commit c3f00c70276d. The patch may cannot be applied between the commit c3f00c70276d and commit 50f16a8bf9d7. Anybody that wants to back-port this at that period may have to find other solutions. Fixes: c3f00c70276d ("perf: Separate find_get_context() from event initialization") Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Zhengjun Xing <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-04-05perf: Optimize perf_pmu_migrate_context()Peter Zijlstra1-5/+7
Thomas reported that offlining CPUs spends a lot of time in synchronize_rcu() as called from perf_pmu_migrate_context() even though he's not actually using uncore events. Turns out, the thing is unconditionally waiting for RCU, even if there's no actual events to migrate. Fixes: 0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()") Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-04-05accel/ivpu: Fix S3 system suspend when not idleJacek Lawrynowicz1-15/+11
Wait for VPU to be idle in ivpu_pm_suspend_cb() before powering off the device, so jobs are not lost and TDRs are not triggered after resume. Fixes: 852be13f3bd3 ("accel/ivpu: Add PM support") Signed-off-by: Stanislaw Gruszka <[email protected]> Reviewed-by: Jeffrey Hugo <[email protected]> Signed-off-by: Jacek Lawrynowicz <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2023-04-05accel/ivpu: Add dma fence to command buffers onlyKarol Wachowski1-11/+7
Currently job->done_fence is added to every BO handle within a job. If job handle (command buffer) is shared between multiple submits, KMD will add the fence in each of them. Then bo_wait_ioctl() executed on command buffer will exit only when all jobs containing that handle are done. This creates deadlock scenario for user mode driver in case when job handle is added as dependency of another job, because bo_wait_ioctl() of first job will wait until second job finishes, and second job can not finish before first one. Having fences added only to job buffer handle allows user space to execute bo_wait_ioctl() on the job even if it's handle is submitted with other job. Fixes: cd7272215c44 ("accel/ivpu: Add command buffer submission logic") Signed-off-by: Karol Wachowski <[email protected]> Signed-off-by: Stanislaw Gruszka <[email protected]> Reviewed-by: Jeffrey Hugo <[email protected]> Signed-off-by: Jacek Lawrynowicz <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2023-04-04tracing: Fix ftrace_boot_snapshot command line logicSteven Rostedt (Google)1-6/+7
The kernel command line ftrace_boot_snapshot by itself is supposed to trigger a snapshot at the end of boot up of the main top level trace buffer. A ftrace_boot_snapshot=foo will do the same for an instance called foo that was created by trace_instance=foo,... The logic was broken where if ftrace_boot_snapshot was by itself, it would trigger a snapshot for all instances that had tracing enabled, regardless if it asked for a snapshot or not. When a snapshot is requested for a buffer, the buffer's tr->allocated_snapshot is set to true. Use that to know if a trace buffer wants a snapshot at boot up or not. Since the top level buffer is part of the ftrace_trace_arrays list, there's no reason to treat it differently than the other buffers. Just iterate the list if ftrace_boot_snapshot was specified. Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Ross Zwisler <[email protected]> Fixes: 9c1c251d670bc ("tracing: Allow boot instances to have snapshot buffers") Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-04-04tracing: Have tracing_snapshot_instance_cond() write errors to the ↵Steven Rostedt (Google)1-7/+7
appropriate instance If a trace instance has a failure with its snapshot code, the error message is to be written to that instance's buffer. But currently, the message is written to the top level buffer. Worse yet, it may also disable the top level buffer and not the instance that had the issue. Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Ross Zwisler <[email protected]> Fixes: 2824f50332486 ("tracing: Make the snapshot trigger work with instances") Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-04-04gve: Secure enough bytes in the first TX desc for all TCP pktsShailend Chand2-7/+7
Non-GSO TCP packets whose SKBs' linear portion did not include the entire TCP header were not populating the first Tx descriptor with as many bytes as the vNIC expected. This change ensures that all TCP packets populate the first descriptor with the correct number of bytes. Fixes: 893ce44df565 ("gve: Add basic driver framework for Compute Engine Virtual NIC") Signed-off-by: Shailend Chand <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-04netlink: annotate lockless accesses to nlk->max_recvmsg_lenEric Dumazet1-6/+9
syzbot reported a data-race in data-race in netlink_recvmsg() [1] Indeed, netlink_recvmsg() can be run concurrently, and netlink_dump() also needs protection. [1] BUG: KCSAN: data-race in netlink_recvmsg / netlink_recvmsg read to 0xffff888141840b38 of 8 bytes by task 23057 on cpu 0: netlink_recvmsg+0xea/0x730 net/netlink/af_netlink.c:1988 sock_recvmsg_nosec net/socket.c:1017 [inline] sock_recvmsg net/socket.c:1038 [inline] __sys_recvfrom+0x1ee/0x2e0 net/socket.c:2194 __do_sys_recvfrom net/socket.c:2212 [inline] __se_sys_recvfrom net/socket.c:2208 [inline] __x64_sys_recvfrom+0x78/0x90 net/socket.c:2208 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd write to 0xffff888141840b38 of 8 bytes by task 23037 on cpu 1: netlink_recvmsg+0x114/0x730 net/netlink/af_netlink.c:1989 sock_recvmsg_nosec net/socket.c:1017 [inline] sock_recvmsg net/socket.c:1038 [inline] ____sys_recvmsg+0x156/0x310 net/socket.c:2720 ___sys_recvmsg net/socket.c:2762 [inline] do_recvmmsg+0x2e5/0x710 net/socket.c:2856 __sys_recvmmsg net/socket.c:2935 [inline] __do_sys_recvmmsg net/socket.c:2958 [inline] __se_sys_recvmmsg net/socket.c:2951 [inline] __x64_sys_recvmmsg+0xe2/0x160 net/socket.c:2951 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x0000000000000000 -> 0x0000000000001000 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 23037 Comm: syz-executor.2 Not tainted 6.3.0-rc4-syzkaller-00195-g5a57b48fdfcb #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023 Fixes: 9063e21fb026 ("netlink: autosize skb lengthes") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-04ethtool: reset #lanes when lanes is omittedAndy Roulin1-3/+4
If the number of lanes was forced and then subsequently the user omits this parameter, the ksettings->lanes is reset. The driver should then reset the number of lanes to the device's default for the specified speed. However, although the ksettings->lanes is set to 0, the mod variable is not set to true to indicate the driver and userspace should be notified of the changes. The consequence is that the same ethtool operation will produce different results based on the initial state. If the initial state is: $ ethtool swp1 | grep -A 3 'Speed: ' Speed: 500000Mb/s Lanes: 2 Duplex: Full Auto-negotiation: on then executing 'ethtool -s swp1 speed 50000 autoneg off' will yield: $ ethtool swp1 | grep -A 3 'Speed: ' Speed: 500000Mb/s Lanes: 2 Duplex: Full Auto-negotiation: off While if the initial state is: $ ethtool swp1 | grep -A 3 'Speed: ' Speed: 500000Mb/s Lanes: 1 Duplex: Full Auto-negotiation: off executing the same 'ethtool -s swp1 speed 50000 autoneg off' results in: $ ethtool swp1 | grep -A 3 'Speed: ' Speed: 500000Mb/s Lanes: 1 Duplex: Full Auto-negotiation: off This patch fixes this behavior. Omitting lanes will always results in the driver choosing the default lane width for the chosen speed. In this scenario, regardless of the initial state, the end state will be, e.g., $ ethtool swp1 | grep -A 3 'Speed: ' Speed: 500000Mb/s Lanes: 2 Duplex: Full Auto-negotiation: off Fixes: 012ce4dd3102 ("ethtool: Extend link modes settings uAPI with lanes") Signed-off-by: Andy Roulin <[email protected]> Reviewed-by: Danielle Ratson <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-04Merge branch 'raw-ping-fix-locking-in-proc-net-raw-icmp'Jakub Kicinski5-35/+33
Kuniyuki Iwashima says: ==================== raw/ping: Fix locking in /proc/net/{raw,icmp}. The first patch fixes a NULL deref for /proc/net/raw and second one fixes the same issue for ping sockets. The first patch also converts hlist_nulls to hlist, but this is because the current code uses sk_nulls_for_each() for lockless readers, instead of sk_nulls_for_each_rcu() which adds memory barrier, but raw sockets does not use the nulls marker nor SLAB_TYPESAFE_BY_RCU in the first place. OTOH, the ping sockets already uses sk_nulls_for_each_rcu(), and such conversion can be posted later for net-next. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-04ping: Fix potentail NULL deref for /proc/net/icmp.Kuniyuki Iwashima1-4/+4
After commit dbca1596bbb0 ("ping: convert to RCU lookups, get rid of rwlock"), we use RCU for ping sockets, but we should use spinlock for /proc/net/icmp to avoid a potential NULL deref mentioned in the previous patch. Let's go back to using spinlock there. Note we can convert ping sockets to use hlist instead of hlist_nulls because we do not use SLAB_TYPESAFE_BY_RCU for ping sockets. Fixes: dbca1596bbb0 ("ping: convert to RCU lookups, get rid of rwlock") Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-04raw: Fix NULL deref in raw_get_next().Kuniyuki Iwashima4-31/+29
Dae R. Jeong reported a NULL deref in raw_get_next() [0]. It seems that the repro was running these sequences in parallel so that one thread was iterating on a socket that was being freed in another netns. unshare(0x40060200) r0 = syz_open_procfs(0x0, &(0x7f0000002080)='net/raw\x00') socket$inet_icmp_raw(0x2, 0x3, 0x1) pread64(r0, &(0x7f0000000000)=""/10, 0xa, 0x10000000007f) After commit 0daf07e52709 ("raw: convert raw sockets to RCU"), we use RCU and hlist_nulls_for_each_entry() to iterate over SOCK_RAW sockets. However, we should use spinlock for slow paths to avoid the NULL deref. Also, SOCK_RAW does not use SLAB_TYPESAFE_BY_RCU, and the slab object is not reused during iteration in the grace period. In fact, the lockless readers do not check the nulls marker with get_nulls_value(). So, SOCK_RAW should use hlist instead of hlist_nulls. Instead of adding an unnecessary barrier by sk_nulls_for_each_rcu(), let's convert hlist_nulls to hlist and use sk_for_each_rcu() for fast paths and sk_for_each() and spinlock for /proc/net/raw. [0]: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] CPU: 2 PID: 20952 Comm: syz-executor.0 Not tainted 6.2.0-g048ec869bafd-dirty #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline] RIP: 0010:sock_net include/net/sock.h:649 [inline] RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline] RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline] RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995 Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206 RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000 RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338 RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9 R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78 R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030 FS: 00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055bb9614b35f CR3: 000000003c672000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> seq_read_iter+0x4c6/0x10f0 fs/seq_file.c:225 seq_read+0x224/0x320 fs/seq_file.c:162 pde_read fs/proc/inode.c:316 [inline] proc_reg_read+0x23f/0x330 fs/proc/inode.c:328 vfs_read+0x31e/0xd30 fs/read_write.c:468 ksys_pread64 fs/read_write.c:665 [inline] __do_sys_pread64 fs/read_write.c:675 [inline] __se_sys_pread64 fs/read_write.c:672 [inline] __x64_sys_pread64+0x1e9/0x280 fs/read_write.c:672 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4e/0xa0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x478d29 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f843ae8dbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 RAX: ffffffffffffffda RBX: 0000000000791408 RCX: 0000000000478d29 RDX: 000000000000000a RSI: 0000000020000000 RDI: 0000000000000003 RBP: 00000000f477909a R08: 0000000000000000 R09: 0000000000000000 R10: 000010000000007f R11: 0000000000000246 R12: 0000000000791740 R13: 0000000000791414 R14: 0000000000791408 R15: 00007ffc2eb48a50 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline] RIP: 0010:sock_net include/net/sock.h:649 [inline] RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline] RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline] RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995 Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206 RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000 RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338 RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9 R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78 R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030 FS: 00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f92ff166000 CR3: 000000003c672000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 0daf07e52709 ("raw: convert raw sockets to RCU") Reported-by: syzbot <[email protected]> Reported-by: Dae R. Jeong <[email protected]> Link: https://lore.kernel.org/netdev/ZCA2mGV_cmq7lIfV@dragonet/ Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-04Merge branch 'for-6.3/cxl-doe-fixes' into for-6.3/cxlDan Williams483-4267/+4547
Pick up the fixes (first 6 patches) from the DOE rework series from Lukas for v6.3-rc. Link: https://lore.kernel.org/all/[email protected]/
2023-04-04cxl/hdm: Extend DVSEC range register emulation for region enumerationDan Williams1-5/+22
One motivation for mapping range registers to decoder objects is to use those settings for region autodiscovery. The need to map a region for devices programmed to use range registers is especially urgent now that the kernel no longer routes "Soft Reserved" ranges in the memory map to device-dax by default. The CXL memory range loses all access mechanisms. Complete the implementation by marking the DPA reservation and setting the endpoint-decoder state to signal autodiscovery. Note that the default settings of ways=1 and granularity=4096 set in cxl_decode_init() do not need to be updated. Fixes: 09d09e04d2fc ("cxl/dax: Create dax devices for CXL RAM regions") Tested-by: Dave Jiang <[email protected]> Tested-by: Gregory Price <[email protected]> Link: https://lore.kernel.org/r/168012575521.221280.14177293493678527326.stgit@dwillia2-xfh.jf.intel.com Reviewed-by: Dave Jiang <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2023-04-04cxl/hdm: Limit emulation to the number of range registersDan Williams1-36/+46
Recall that range register emulation seeks to treat the 2 potential range registers as Linux CXL "decoder" objects. The number of range registers can be 1 or 2, while HDM decoder ranges can include more than 2. Be careful not to confuse DVSEC range count with HDM capability decoder count. Commit to range register earlier in devm_cxl_setup_hdm(). Otherwise, a device with more HDM decoders than range registers can set @cxlhdm->decoder_count to an invalid value. Avoid introducing a forward declaration by just moving the definition of should_emulate_decoders() earlier in the file. should_emulate_decoders() is unchanged. Tested-by: Dave Jiang <[email protected]> Fixes: d7a2153762c7 ("cxl/hdm: Add emulation when HDM decoders are not committed") Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Link: https://lore.kernel.org/r/168012574932.221280.15944705098679646436.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <[email protected]>
2023-04-04cxl/region: Move coherence tracking into cxl_region_attach()Dan Williams1-2/+1
Each time the contents of a given HPA are potentially changed in a cache incoherent manner the CXL core sets CXL_REGION_F_INCOHERENT to invalidate CPU caches before the region is used. Successful invocation of attach_target() indicates that DPA has been newly assigned to a given HPA in the dynamic region creation flow. However, attach_target() is also reused in the autodiscovery flow where the region was activated by platform firmware. In that case there is no need to invalidate caches because that region is already in active use and nothing about the autodiscovery flow modifies the HPA-to-DPA relationship. In the autodiscovery case cxl_region_attach() exits early after determining the endpoint decoder is already correctly attached to the region. Fixes: a32320b71f08 ("cxl/region: Add region autodiscovery") Reviewed-by: Fan Ni <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Link: https://lore.kernel.org/r/168002858817.50647.1217607907088920888.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <[email protected]>
2023-04-04cxl/region: Fix region setup/teardown for RCDsDan Williams1-1/+27
RCDs (CXL memory devices that link train without VH capability and show up as root complex integrated endpoints), hide the presence of the link between the endpoint and the host-bridge. The CXL region setup/teardown paths assume that a link hop is present and go looking for at least one 'struct cxl_port' instance between the CXL root port-object and an endpoint port-object leading to crashes of the form: BUG: kernel NULL pointer dereference, address: 0000000000000008 [..] RIP: 0010:cxl_region_setup_targets+0x3e9/0xae0 [cxl_core] [..] Call Trace: <TASK> cxl_region_attach+0x46c/0x7a0 [cxl_core] cxl_create_region+0x20b/0x270 [cxl_core] cxl_mock_mem_probe+0x641/0x800 [cxl_mock_mem] platform_probe+0x5b/0xb0 Detect RCDs explicitly and skip walking the non-existent port hierarchy between root and endpoint in that case. While this has been a problem since: commit 0a19bfc8de93 ("cxl/port: Add RCD endpoint port enumeration") ...it becomes a more reliable crash scenario with the new autodiscovery implementation. Fixes: a32320b71f08 ("cxl/region: Add region autodiscovery") Reviewed-by: Ira Weiny <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Link: https://lore.kernel.org/r/168002858268.50647.728091521032131326.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <[email protected]>
2023-04-04cxl/port: Fix find_cxl_root() for RCDs and simplify itDan Williams5-38/+14
The find_cxl_root() helper is used to lookup root decoders and other CXL platform topology information for a given endpoint. It turns out that for RCDs it has never worked. The result of find_cxl_root(&cxlmd->dev) is always NULL for the RCH topology case because it expects to find a cxl_port at the host-bridge. RCH topologies only have the root cxl_port object with the host-bridge as a dport. While there are no reports of this being a problem to date, by inspection region enumeration should crash as a result of this problem, and it does in a local unit test for this scenario. However, an observation that ever since: commit f17b558d6663 ("cxl/pmem: Refactor nvdimm device registration, delete the workqueue") ...all callers of find_cxl_root() occur after the memdev connection to the port topology has been established. That means that find_cxl_root() can be simplified to a walk of the endpoint port topology to the root. Switch to that arrangement which also fixes the RCD bug. Fixes: a32320b71f08 ("cxl/region: Add region autodiscovery") Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Link: https://lore.kernel.org/r/168002857715.50647.344876437247313909.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <[email protected]>
2023-04-04cxl/hdm: Skip emulation when driver manages mem_enableDan Williams3-15/+22
If the driver is allowed to enable memory operation itself then it can also turn on HDM decoder support at will. With this the second call to cxl_setup_hdm_decoder_from_dvsec(), when an HDM decoder is not committed, is not needed. Fixes: b777e9bec960 ("cxl/hdm: Emulate HDM decoder from DVSEC range registers") Link: http://lore.kernel.org/r/[email protected] Reported-by: Jonathan Cameron <[email protected]> Tested-by: Jonathan Cameron <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Fan Ni <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Link: https://lore.kernel.org/r/167703068474.185722.664126485486344246.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <[email protected]>
2023-04-04cxl/hdm: Fix double allocation of @cxlhdmDan Williams1-28/+6
devm_cxl_setup_emulated_hdm() reallocates an instance of @cxlhdm that was already allocated at the start of devm_cxl_setup_hdm(). Only one is needed and devm_cxl_setup_emulated_hdm() does not do enough to warrant being an explicit helper. Fixes: 4474ce565ee4 ("cxl/hdm: Create emulated cxl_hdm for devices that do not have HDM decoders") Tested-by: Dave Jiang <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Link: https://lore.kernel.org/r/167703067936.185722.7908921750127154779.stgit@dwillia2-xfh.jf.intel.com Link: https://lore.kernel.org/r/168012574357.221280.5001364964799725366.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <[email protected]>
2023-04-04blk-mq: directly poll requestsKeith Busch1-3/+1
Polling needs a bio with a valid bi_bdev, but neither of those are guaranteed for polled driver requests. Make request based polling directly use blk-mq's polling function instead. When executing a request from a polled hctx, we know the request's cookie, and that it's from a live blk-mq queue that supports polling, so we can safely skip everything that bio_poll provides. Cc: [email protected] Reported-by: Martin Belanger <[email protected]> Reported-by: Daniel Wagner <[email protected]> Signed-off-by: Keith Busch <[email protected]> Tested-by: Daniel Wagner <[email protected]> Revieded-by: Daniel Wagner <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Tested-by: Shin'ichiro Kawasaki <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-04-04Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds15-29/+192
Pull kvm fixes from Paolo Bonzini: "PPC: - Hide KVM_CAP_IRQFD_RESAMPLE if XIVE is enabled s390: - Fix handling of external interrupts in protected guests x86: - Resample the pending state of IOAPIC interrupts when unmasking them - Fix usage of Hyper-V "enlightened TLB" on AMD - Small fixes to real mode exceptions - Suppress pending MMIO write exits if emulator detects exception Documentation: - Fix rST syntax" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: docs: kvm: x86: Fix broken field list KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent KVM: s390: pv: fix external interruption loop not always detected KVM: nVMX: Do not report error code when synthesizing VM-Exit from Real Mode KVM: x86: Clear "has_error_code", not "error_code", for RM exception injection KVM: x86: Suppress pending MMIO write exits if emulator detects exception KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking KVM: irqfd: Make resampler_list an RCU list KVM: SVM: Flush Hyper-V TLB when required
2023-04-04Merge tag 'nfsd-6.3-5' of ↵Linus Torvalds5-14/+24
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a crash and a resource leak in NFSv4 COMPOUND processing - Fix issues with AUTH_SYS credential handling - Try again to address an NFS/NFSD/SUNRPC build dependency regression * tag 'nfsd-6.3-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: callback request does not use correct credential for AUTH_SYS NFS: Remove "select RPCSEC_GSS_KRB5 sunrpc: only free unix grouplist after RCU settles nfsd: call op_release, even when op_func returns an error NFSD: Avoid calling OPDESC() with ops->opnum == OP_ILLEGAL
2023-04-04docs: kvm: x86: Fix broken field listTakahiro Itazuri1-2/+2
Add a missing ":" to fix a broken field list. Signed-off-by: Takahiro Itazuri <[email protected]> Fixes: ba7bb663f554 ("KVM: x86: Provide per VM capability for disabling PMU virtualization") Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2023-04-04asm-generic: avoid __generic_cmpxchg_local warningsArnd Bergmann3-11/+11
Code that passes a 32-bit constant into cmpxchg() produces a harmless sparse warning because of the truncation in the branch that is not taken: fs/erofs/zdata.c: note: in included file (through /home/arnd/arm-soc/arch/arm/include/asm/cmpxchg.h, /home/arnd/arm-soc/arch/arm/include/asm/atomic.h, /home/arnd/arm-soc/include/linux/atomic.h, ...): include/asm-generic/cmpxchg-local.h:29:33: warning: cast truncates bits from constant value (5f0ecafe becomes fe) include/asm-generic/cmpxchg-local.h:33:34: warning: cast truncates bits from constant value (5f0ecafe becomes cafe) include/asm-generic/cmpxchg-local.h:29:33: warning: cast truncates bits from constant value (5f0ecafe becomes fe) include/asm-generic/cmpxchg-local.h:30:42: warning: cast truncates bits from constant value (5f0edead becomes ad) include/asm-generic/cmpxchg-local.h:33:34: warning: cast truncates bits from constant value (5f0ecafe becomes cafe) include/asm-generic/cmpxchg-local.h:34:44: warning: cast truncates bits from constant value (5f0edead becomes dead) This was reported as a regression to Matt's recent __generic_cmpxchg_local patch, though this patch only added more warnings on top of the ones that were already there. Rewording the truncation to use an explicit bitmask instead of a cast to a smaller type avoids the warning but otherwise leaves the code unchanged. I had another look at why the cast is even needed for atomic_cmpxchg(), and as Matt describes the problem here is that atomic_t contains a signed 'int', but cmpxchg() takes an 'unsigned long' argument, and converting between the two leads to a 64-bit sign-extension of negative 32-bit atomics. I checked the other implementations of arch_cmpxchg() and did not find any others that run into the same problem as __generic_cmpxchg_local(), but it's easy to be on the safe side here and always convert the signed int into an unsigned int when calling arch_cmpxchg(), as this will work even when any of the arch_cmpxchg() implementations run into the same problem. Fixes: 624654152284 ("locking/atomic: cmpxchg: Make __generic_cmpxchg_local compare against zero-extended 'old' value") Reviewed-by: Matt Evans <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2023-04-04asm-generic/io.h: suppress endianness warnings for relaxed accessorsVladimir Oltean1-6/+6
Copy the forced type casts from the normal MMIO accessors to suppress the sparse warnings that point out __raw_readl() returns a native endian word (just like readl()). Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2023-04-04asm-generic/io.h: suppress endianness warnings for readq() and writeq()Vladimir Oltean1-2/+2
Commit c1d55d50139b ("asm-generic/io.h: Fix sparse warnings on big-endian architectures") missed fixing the 64-bit accessors. Arnd explains in the attached link why the casts are necessary, even if __raw_readq() and __raw_writeq() do not take endian-specific types. Link: https://lore.kernel.org/lkml/[email protected]/ Suggested-by: Arnd Bergmann <[email protected]> Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2023-04-04ice: Reset FDIR counter in FDIR init stageLingyu Liu1-0/+16
Reset the FDIR counters when FDIR inits. Without this patch, when VF initializes or resets, all the FDIR counters are not cleaned, which may cause unexpected behaviors for future FDIR rule create (e.g., rule conflict). Fixes: 1f7ea1cd6a37 ("ice: Enable FDIR Configure for AVF") Signed-off-by: Junfeng Guo <[email protected]> Signed-off-by: Lingyu Liu <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
2023-04-04ice: fix wrong fallback logic for FDIRSimei Su1-3/+4
When adding a FDIR filter, if ice_vc_fdir_set_irq_ctx returns failure, the inserted fdir entry will not be removed and if ice_vc_fdir_write_fltr returns failure, the fdir context info for irq handler will not be cleared which may lead to inconsistent or memory leak issue. This patch refines failure cases to resolve this issue. Fixes: 1f7ea1cd6a37 ("ice: Enable FDIR Configure for AVF") Signed-off-by: Simei Su <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
2023-04-04NFSD: callback request does not use correct credential for AUTH_SYSDai Ngo1-2/+2
Currently callback request does not use the credential specified in CREATE_SESSION if the security flavor for the back channel is AUTH_SYS. Problem was discovered by pynfs 4.1 DELEG5 and DELEG7 test with error: DELEG5 st_delegation.testCBSecParms : FAILURE expected callback with uid, gid == 17, 19, got 0, 0 Signed-off-by: Dai Ngo <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Fixes: 8276c902bbe9 ("SUNRPC: remove uid and gid from struct auth_cred") Signed-off-by: Chuck Lever <[email protected]>
2023-04-04NFS: Remove "select RPCSEC_GSS_KRB5Chuck Lever1-1/+0
If CONFIG_CRYPTO=n (e.g. arm/shmobile_defconfig): WARNING: unmet direct dependencies detected for RPCSEC_GSS_KRB5 Depends on [n]: NETWORK_FILESYSTEMS [=y] && SUNRPC [=y] && CRYPTO [=n] Selected by [y]: - NFS_V4 [=y] && NETWORK_FILESYSTEMS [=y] && NFS_FS [=y] As NFSv4 can work without crypto enabled, remove the RPCSEC_GSS_KRB5 dependency altogether. Trond says: > It is possible to use the NFSv4.1 client with just AUTH_SYS, and > in fact there are plenty of people out there using only that. The > fact that RFC5661 gets its knickers in a twist about RPCSEC_GSS > support is largely irrelevant to those people. > > The other issue is that ’select’ enforces the strict dependency > that if the NFS client is compiled into the kernel, then the > RPCSEC_GSS and kerberos code needs to be compiled in as well: they > cannot exist as modules. Fixes: e57d06527738 ("NFS & NFSD: Update GSS dependencies") Reported-by: kernel test robot <[email protected]> Reported-by: Niklas Söderlund <[email protected]> Suggested-by: Trond Myklebust <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2023-04-04sunrpc: only free unix grouplist after RCU settlesJeff Layton1-4/+13
While the unix_gid object is rcu-freed, the group_info list that it contains is not. Ensure that we only put the group list reference once we are really freeing the unix_gid object. Reported-by: Zhi Li <[email protected]> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2183056 Signed-off-by: Jeff Layton <[email protected]> Fixes: fd5d2f78261b ("SUNRPC: Make server side AUTH_UNIX use lockless lookups") Signed-off-by: Chuck Lever <[email protected]>
2023-04-04net: stmmac: fix up RX flow hash indirection table when setting channelsCorinna Vinschen1-1/+5
stmmac_reinit_queues() fails to fix up the RX hash. Even if the number of channels gets restricted, the output of `ethtool -x' indicates that all RX queues are used: $ ethtool -l enp0s29f2 Channel parameters for enp0s29f2: Pre-set maximums: RX: 8 TX: 8 Other: n/a Combined: n/a Current hardware settings: RX: 8 TX: 8 Other: n/a Combined: n/a $ ethtool -x enp0s29f2 RX flow hash indirection table for enp0s29f2 with 8 RX ring(s): 0: 0 1 2 3 4 5 6 7 8: 0 1 2 3 4 5 6 7 [...] $ ethtool -L enp0s29f2 rx 3 $ ethtool -x enp0s29f2 RX flow hash indirection table for enp0s29f2 with 3 RX ring(s): 0: 0 1 2 3 4 5 6 7 8: 0 1 2 3 4 5 6 7 [...] Fix this by setting the indirection table according to the number of specified queues. The result is now as expected: $ ethtool -L enp0s29f2 rx 3 $ ethtool -x enp0s29f2 RX flow hash indirection table for enp0s29f2 with 3 RX ring(s): 0: 0 1 2 0 1 2 0 1 8: 2 0 1 2 0 1 2 0 [...] Tested on Intel Elkhart Lake. Fixes: 0366f7e06a6b ("net: stmmac: add ethtool support for get/set channels") Signed-off-by: Corinna Vinschen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2023-04-04iommufd: Do not corrupt the pfn list when doing batch carryJason Gunthorpe1-1/+1
If batch->end is 0 then setting npfns[0] before computing the new value of pfns will fail to adjust the pfn and result in various page accounting corruptions. It should be ordered after. This seems to result in various kinds of page meta-data corruption related failures: WARNING: CPU: 1 PID: 527 at mm/gup.c:75 try_grab_folio+0x503/0x740 Modules linked in: CPU: 1 PID: 527 Comm: repro Not tainted 6.3.0-rc2-eeac8ede1755+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:try_grab_folio+0x503/0x740 Code: e3 01 48 89 de e8 6d c1 dd ff 48 85 db 0f 84 7c fe ff ff e8 4f bf dd ff 49 8d 47 ff 48 89 45 d0 e9 73 fe ff ff e8 3d bf dd ff <0f> 0b 31 db e9 d0 fc ff ff e8 2f bf dd ff 48 8b 5d c8 31 ff 48 89 RSP: 0018:ffffc90000f37908 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 00000000fffffc02 RCX: ffffffff81504c26 RDX: 0000000000000000 RSI: ffff88800d030000 RDI: 0000000000000002 RBP: ffffc90000f37948 R08: 000000000003ca24 R09: 0000000000000008 R10: 000000000003ca00 R11: 0000000000000023 R12: ffffea000035d540 R13: 0000000000000001 R14: 0000000000000000 R15: ffffea000035d540 FS: 00007fecbf659740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000200011c3 CR3: 000000000ef66006 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> internal_get_user_pages_fast+0xd32/0x2200 pin_user_pages_fast+0x65/0x90 pfn_reader_user_pin+0x376/0x390 pfn_reader_next+0x14a/0x7b0 pfn_reader_first+0x140/0x1b0 iopt_area_fill_domain+0x74/0x210 iopt_table_add_domain+0x30e/0x6e0 iommufd_device_selftest_attach+0x7f/0x140 iommufd_test+0x10ff/0x16f0 iommufd_fops_ioctl+0x206/0x330 __x64_sys_ioctl+0x10e/0x160 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Cc: <[email protected]> Fixes: f394576eb11d ("iommufd: PFN handling for iopt_pages") Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Kevin Tian <[email protected]> Reported-by: Pengfei Xu <[email protected]> Tested-by: Pengfei Xu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2023-04-04iommufd: Fix unpinning of pages when an access is presentJason Gunthorpe1-1/+9
syzkaller found that the calculation of batch_last_index should use 'start_index' since at input to this function the batch is either empty or it has already been adjusted to cross any accesses so it will start at the point we are unmapping from. Getting this wrong causes the unmap to run over the end of the pages which corrupts pages that were never mapped. In most cases this triggers the num pinned debugging: WARNING: CPU: 0 PID: 557 at drivers/iommu/iommufd/pages.c:294 __iopt_area_unfill_domain+0x152/0x560 Modules linked in: CPU: 0 PID: 557 Comm: repro Not tainted 6.3.0-rc2-eeac8ede1755 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__iopt_area_unfill_domain+0x152/0x560 Code: d2 0f ff 44 8b 64 24 54 48 8b 44 24 48 31 ff 44 89 e6 48 89 44 24 38 e8 fc d3 0f ff 45 85 e4 0f 85 eb 01 00 00 e8 0e d2 0f ff <0f> 0b e8 07 d2 0f ff 48 8b 44 24 38 89 5c 24 58 89 18 8b 44 24 54 RSP: 0018:ffffc9000108baf0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffffffff821e3f85 RDX: 0000000000000000 RSI: ffff88800faf0000 RDI: 0000000000000002 RBP: ffffc9000108bd18 R08: 000000000003ca25 R09: 0000000000000014 R10: 000000000003ca00 R11: 0000000000000024 R12: 0000000000000004 R13: 0000000000000801 R14: 00000000000007ff R15: 0000000000000800 FS: 00007f3499ce1740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000243 CR3: 00000000179c2001 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> iopt_area_unfill_domain+0x32/0x40 iopt_table_remove_domain+0x23f/0x4c0 iommufd_device_selftest_detach+0x3a/0x90 iommufd_selftest_destroy+0x55/0x70 iommufd_object_destroy_user+0xce/0x130 iommufd_destroy+0xa2/0xc0 iommufd_fops_ioctl+0x206/0x330 __x64_sys_ioctl+0x10e/0x160 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Also add some useful WARN_ON sanity checks. Cc: <[email protected]> Fixes: 8d160cd4d506 ("iommufd: Algorithms for PFN storage") Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Kevin Tian <[email protected]> Reported-by: Pengfei Xu <[email protected]> Tested-by: Pengfei Xu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2023-04-04iommufd: Check for uptr overflowJason Gunthorpe1-0/+4
syzkaller found that setting up a map with a user VA that wraps past zero can trigger WARN_ONs, particularly from pin_user_pages weirdly returning 0 due to invalid arguments. Prevent creating a pages with a uptr and size that would math overflow. WARNING: CPU: 0 PID: 518 at drivers/iommu/iommufd/pages.c:793 pfn_reader_user_pin+0x2e6/0x390 Modules linked in: CPU: 0 PID: 518 Comm: repro Not tainted 6.3.0-rc2-eeac8ede1755+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:pfn_reader_user_pin+0x2e6/0x390 Code: b1 11 e9 25 fe ff ff e8 28 e4 0f ff 31 ff 48 89 de e8 2e e6 0f ff 48 85 db 74 0a e8 14 e4 0f ff e9 4d ff ff ff e8 0a e4 0f ff <0f> 0b bb f2 ff ff ff e9 3c ff ff ff e8 f9 e3 0f ff ba 01 00 00 00 RSP: 0018:ffffc90000f9fa30 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff821e2b72 RDX: 0000000000000000 RSI: ffff888014184680 RDI: 0000000000000002 RBP: ffffc90000f9fa78 R08: 00000000000000ff R09: 0000000079de6f4e R10: ffffc90000f9f790 R11: ffff888014185418 R12: ffffc90000f9fc60 R13: 0000000000000002 R14: ffff888007879800 R15: 0000000000000000 FS: 00007f4227555740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000043 CR3: 000000000e748005 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> pfn_reader_next+0x14a/0x7b0 ? interval_tree_double_span_iter_update+0x11a/0x140 pfn_reader_first+0x140/0x1b0 iopt_pages_rw_slow+0x71/0x280 ? __this_cpu_preempt_check+0x20/0x30 iopt_pages_rw_access+0x2b2/0x5b0 iommufd_access_rw+0x19f/0x2f0 iommufd_test+0xd11/0x16f0 ? write_comp_data+0x2f/0x90 iommufd_fops_ioctl+0x206/0x330 __x64_sys_ioctl+0x10e/0x160 ? __pfx_iommufd_fops_ioctl+0x10/0x10 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Cc: <[email protected]> Fixes: 8d160cd4d506 ("iommufd: Algorithms for PFN storage") Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Kevin Tian <[email protected]> Reported-by: Pengfei Xu <[email protected]> Tested-by: Pengfei Xu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2023-04-04net: ethernet: ti: am65-cpsw: Fix mdio cleanup in probeSiddharth Vadapalli1-2/+4
In the am65_cpsw_nuss_probe() function's cleanup path, the call to of_platform_device_destroy() for the common->mdio_dev device is invoked unconditionally. It is possible that either the MDIO node is not present in the device-tree, or the MDIO node is disabled in the device-tree. In both these cases, the MDIO device is not created, resulting in a NULL pointer dereference when the of_platform_device_destroy() function is invoked on the common->mdio_dev device on the cleanup path. Fix this by ensuring that the common->mdio_dev device exists, before attempting to invoke of_platform_device_destroy(). Fixes: a45cfcc69a25 ("net: ethernet: ti: am65-cpsw-nuss: use of_platform_device_create() for mdio") Signed-off-by: Siddharth Vadapalli <[email protected]> Reviewed-by: Roger Quadros <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2023-04-03PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=yLukas Wunner1-0/+1
After a pci_doe_task completes, its work_struct needs to be destroyed to avoid a memory leak with CONFIG_DEBUG_OBJECTS=y. Fixes: 9d24322e887b ("PCI/DOE: Add DOE mailbox support functions") Tested-by: Ira Weiny <[email protected]> Signed-off-by: Lukas Wunner <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Cc: [email protected] # v6.0+ Acked-by: Bjorn Helgaas <[email protected]> Link: https://lore.kernel.org/r/775768b4912531c3b887d405fc51a50e465e1bf9.1678543498.git.lukas@wunner.de Signed-off-by: Dan Williams <[email protected]>
2023-04-03PCI/DOE: Silence WARN splat with CONFIG_DEBUG_OBJECTS=yLukas Wunner1-1/+3
Gregory Price reports a WARN splat with CONFIG_DEBUG_OBJECTS=y upon CXL probing because pci_doe_submit_task() invokes INIT_WORK() instead of INIT_WORK_ONSTACK() for a work_struct that was allocated on the stack. All callers of pci_doe_submit_task() allocate the work_struct on the stack, so replace INIT_WORK() with INIT_WORK_ONSTACK() as a backportable short-term fix. The long-term fix implemented by a subsequent commit is to move to a synchronous API which allocates the work_struct internally in the DOE library. Stacktrace for posterity: WARNING: CPU: 0 PID: 23 at lib/debugobjects.c:545 __debug_object_init.cold+0x18/0x183 CPU: 0 PID: 23 Comm: kworker/u2:1 Not tainted 6.1.0-0.rc1.20221019gitaae703b02f92.17.fc38.x86_64 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Call Trace: pci_doe_submit_task+0x5d/0xd0 pci_doe_discovery+0xb4/0x100 pcim_doe_create_mb+0x219/0x290 cxl_pci_probe+0x192/0x430 local_pci_probe+0x41/0x80 pci_device_probe+0xb3/0x220 really_probe+0xde/0x380 __driver_probe_device+0x78/0x170 driver_probe_device+0x1f/0x90 __driver_attach_async_helper+0x5c/0xe0 async_run_entry_fn+0x30/0x130 process_one_work+0x294/0x5b0 Fixes: 9d24322e887b ("PCI/DOE: Add DOE mailbox support functions") Link: https://lore.kernel.org/linux-cxl/[email protected]/ Reported-by: Gregory Price <[email protected]> Tested-by: Ira Weiny <[email protected]> Tested-by: Gregory Price <[email protected]> Signed-off-by: Lukas Wunner <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Gregory Price <[email protected]> Cc: [email protected] # v6.0+ Reviewed-by: Jonathan Cameron <[email protected]> Acked-by: Bjorn Helgaas <[email protected]> Link: https://lore.kernel.org/r/67a9117f463ecdb38a2dbca6a20391ce2f1e7a06.1678543498.git.lukas@wunner.de Signed-off-by: Dan Williams <[email protected]>
2023-04-03cxl/pci: Handle excessive CDAT lengthLukas Wunner1-0/+3
If the length in the CDAT header is larger than the concatenation of the header and all table entries, then the CDAT exposed to user space contains trailing null bytes. Not every consumer may be able to handle that. Per Postel's robustness principle, "be liberal in what you accept" and silently reduce the cached length to avoid exposing those null bytes. Fixes: c97006046c79 ("cxl/port: Read CDAT table") Tested-by: Ira Weiny <[email protected]> Signed-off-by: Lukas Wunner <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Cc: [email protected] # v6.0+ Link: https://lore.kernel.org/r/6d98b3c7da5343172bd3ccabfabbc1f31c079d74.1678543498.git.lukas@wunner.de Signed-off-by: Dan Williams <[email protected]>
2023-04-03cxl/pci: Handle truncated CDAT entriesLukas Wunner2-4/+23
If truncated CDAT entries are received from a device, the concatenation of those entries constitutes a corrupt CDAT, yet is happily exposed to user space. Avoid by verifying response lengths and erroring out if truncation is detected. The last CDAT entry may still be truncated despite the checks introduced herein if the length in the CDAT header is too small. However, that is easily detectable by user space because it reaches EOF prematurely. A subsequent commit which rightsizes the CDAT response allocation closes that remaining loophole. The two lines introduced here which exceed 80 chars are shortened to less than 80 chars by a subsequent commit which migrates to a synchronous DOE API and replaces "t.task.rv" by "rc". The existing acpi_cdat_header and acpi_table_cdat struct definitions provided by ACPICA cannot be used because they do not employ __le16 or __le32 types. I believe that cannot be changed because those types are Linux-specific and ACPI is specified for little endian platforms only, hence doesn't care about endianness. So duplicate the structs. Fixes: c97006046c79 ("cxl/port: Read CDAT table") Tested-by: Ira Weiny <[email protected]> Signed-off-by: Lukas Wunner <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Cc: [email protected] # v6.0+ Link: https://lore.kernel.org/r/bce3aebc0e8e18a1173425a7a865b232c3912963.1678543498.git.lukas@wunner.de Signed-off-by: Dan Williams <[email protected]>
2023-04-03cxl/pci: Handle truncated CDAT headerLukas Wunner1-1/+1
cxl_cdat_get_length() only checks whether the DOE response size is sufficient for the Table Access response header (1 dword), but not the succeeding CDAT header (1 dword length plus other fields). It thus returns whatever uninitialized memory happens to be on the stack if a truncated DOE response with only 1 dword was received. Fix it. Fixes: c97006046c79 ("cxl/port: Read CDAT table") Reported-by: Ming Li <[email protected]> Tested-by: Ira Weiny <[email protected]> Signed-off-by: Lukas Wunner <[email protected]> Reviewed-by: Ming Li <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Cc: [email protected] # v6.0+ Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]> Link: https://lore.kernel.org/r/000e69cd163461c8b1bc2cf4155b6e25402c29c7.1678543498.git.lukas@wunner.de Signed-off-by: Dan Williams <[email protected]>
2023-04-03Merge tag 'vfs.misc.fixes.v6.3-rc6' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs fix from Christian Brauner: "When a mount or mount tree is made shared the vfs allocates new peer group ids for all mounts that have no peer group id set. Only mounts that aren't marked with MNT_SHARED are relevant here as MNT_SHARED indicates that the mount has fully transitioned to a shared mount. The peer group id handling is done with namespace lock held. On failure, the peer group id settings of mounts for which a new peer group id was allocated need to be reverted and the allocated peer group id freed. The cleanup_group_ids() helper can identify the mounts to cleanup by checking whether a given mount has a peer group id set but isn't marked MNT_SHARED. The deallocation always needs to happen with namespace lock held to protect against concurrent modifications of the propagation settings. This fixes the one place where the namespace lock was dropped before calling cleanup_group_ids()" * tag 'vfs.misc.fixes.v6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fs: drop peer group ids under namespace lock
2023-04-03Merge tag 'hyperv-fixes-signed-20230402' of ↵Linus Torvalds2-4/+12
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Fix a bug in channel allocation for VMbus (Mohammed Gamal) - Do not allow root partition functionality in CVM (Michael Kelley) * tag 'hyperv-fixes-signed-20230402' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Block root partition functionality in a Confidential VM Drivers: vmbus: Check for channel allocation before looking up relids
2023-04-03tracing: Error if a trace event has an array for a __field()Steven Rostedt (Google)1-4/+17
A __field() in the TRACE_EVENT() macro is used to set up the fields of the trace event data. It is for single storage units (word, char, int, pointer, etc) and not for complex structures or arrays. Unfortunately, there's nothing preventing the build from accepting: __field(int, arr[5]); from building. It will turn into a array value. This use to work fine, as the offset and size use to be determined by the macro using the field name, but things have changed and the offset and size are now determined by the type. So the above would only be size 4, and the next field will be located 4 bytes from it (instead of 20). The proper way to declare static arrays is to use the __array() macro. Instead of __field(int, arr[5]) it should be __array(int, arr, 5). Add some macro tricks to the building of a trace event from the TRACE_EVENT() macro such that __field(int, arr[5]) will fail to build. A comment by the failure will explain why the build failed. Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Reported-by: Douglas RAILLARD <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]>
2023-04-03tracing/osnoise: Fix notify new tracing_max_latencyDaniel Bristot de Oliveira1-1/+1
osnoise/timerlat tracers are reporting new max latency on instances where the tracing is off, creating inconsistencies between the max reported values in the trace and in the tracing_max_latency. Thus only report new tracing_max_latency on active tracing instances. Link: https://lkml.kernel.org/r/ecd109fde4a0c24ab0f00ba1e9a144ac19a91322.1680104184.git.bristot@kernel.org Cc: [email protected] Fixes: dae181349f1e ("tracing/osnoise: Support a list of trace_array *tr") Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-04-03tracing/timerlat: Notify new max thread latencyDaniel Bristot de Oliveira1-0/+2
timerlat is not reporting a new tracing_max_latency for the thread latency. The reason is that it is not calling notify_new_max_latency() function after the new thread latency is sampled. Call notify_new_max_latency() after computing the thread latency. Link: https://lkml.kernel.org/r/16e18d61d69073d0192ace07bf61e405cca96e9c.1680104184.git.bristot@kernel.org Cc: [email protected] Fixes: dae181349f1e ("tracing/osnoise: Support a list of trace_array *tr") Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-04-03ftrace: Mark get_lock_parent_ip() __always_inlineJohn Keeping1-1/+1
If the compiler decides not to inline this function then preemption tracing will always show an IP inside the preemption disabling path and never the function actually calling preempt_{enable,disable}. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: [email protected] Fixes: f904f58263e1d ("sched/debug: Fix preempt_disable_ip recording for preempt_disable()") Signed-off-by: John Keeping <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-04-03ring-buffer: Fix race while reader and writer are on the same pageZheng Yejian1-1/+12
When user reads file 'trace_pipe', kernel keeps printing following logs that warn at "cpu_buffer->reader_page->read > rb_page_size(reader)" in rb_get_reader_page(). It just looks like there's an infinite loop in tracing_read_pipe(). This problem occurs several times on arm64 platform when testing v5.10 and below. Call trace: rb_get_reader_page+0x248/0x1300 rb_buffer_peek+0x34/0x160 ring_buffer_peek+0xbc/0x224 peek_next_entry+0x98/0xbc __find_next_entry+0xc4/0x1c0 trace_find_next_entry_inc+0x30/0x94 tracing_read_pipe+0x198/0x304 vfs_read+0xb4/0x1e0 ksys_read+0x74/0x100 __arm64_sys_read+0x24/0x30 el0_svc_common.constprop.0+0x7c/0x1bc do_el0_svc+0x2c/0x94 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb4 el0_sync+0x160/0x180 Then I dump the vmcore and look into the problematic per_cpu ring_buffer, I found that tail_page/commit_page/reader_page are on the same page while reader_page->read is obviously abnormal: tail_page == commit_page == reader_page == { .write = 0x100d20, .read = 0x8f9f4805, // Far greater than 0xd20, obviously abnormal!!! .entries = 0x10004c, .real_end = 0x0, .page = { .time_stamp = 0x857257416af0, .commit = 0xd20, // This page hasn't been full filled. // .data[0...0xd20] seems normal. } } The root cause is most likely the race that reader and writer are on the same page while reader saw an event that not fully committed by writer. To fix this, add memory barriers to make sure the reader can see the content of what is committed. Since commit a0fcaaed0c46 ("ring-buffer: Fix race between reset page and reading page") has added the read barrier in rb_get_reader_page(), here we just need to add the write barrier. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Fixes: 77ae365eca89 ("ring-buffer: make lockless") Suggested-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Zheng Yejian <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-04-03tracing/synthetic: Fix races on freeing last_cmdTze-nan Wu1-4/+15
Currently, the "last_cmd" variable can be accessed by multiple processes asynchronously when multiple users manipulate synthetic_events node at the same time, it could lead to use-after-free or double-free. This patch add "lastcmd_mutex" to prevent "last_cmd" from being accessed asynchronously. ================================================================ It's easy to reproduce in the KASAN environment by running the two scripts below in different shells. script 1: while : do echo -n -e '\x88' > /sys/kernel/tracing/synthetic_events done script 2: while : do echo -n -e '\xb0' > /sys/kernel/tracing/synthetic_events done ================================================================ double-free scenario: process A process B ------------------- --------------- 1.kstrdup last_cmd 2.free last_cmd 3.free last_cmd(double-free) ================================================================ use-after-free scenario: process A process B ------------------- --------------- 1.kstrdup last_cmd 2.free last_cmd 3.tracing_log_err(use-after-free) ================================================================ Appendix 1. KASAN report double-free: BUG: KASAN: double-free in kfree+0xdc/0x1d4 Free of addr ***** by task sh/4879 Call trace: ... kfree+0xdc/0x1d4 create_or_delete_synth_event+0x60/0x1e8 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Allocated by task 4879: ... kstrdup+0x5c/0x98 create_or_delete_synth_event+0x6c/0x1e8 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Freed by task 5464: ... kfree+0xdc/0x1d4 create_or_delete_synth_event+0x60/0x1e8 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... ================================================================ Appendix 2. KASAN report use-after-free: BUG: KASAN: use-after-free in strlen+0x5c/0x7c Read of size 1 at addr ***** by task sh/5483 sh: CPU: 7 PID: 5483 Comm: sh ... __asan_report_load1_noabort+0x34/0x44 strlen+0x5c/0x7c tracing_log_err+0x60/0x444 create_or_delete_synth_event+0xc4/0x204 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Allocated by task 5483: ... kstrdup+0x5c/0x98 create_or_delete_synth_event+0x80/0x204 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Freed by task 5480: ... kfree+0xdc/0x1d4 create_or_delete_synth_event+0x74/0x204 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Fixes: 27c888da9867 ("tracing: Remove size restriction on synthetic event cmd error logging") Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Matthias Brugger <[email protected]> Cc: AngeloGioacchino Del Regno <[email protected]> Cc: "Tom Zanussi" <[email protected]> Signed-off-by: Tze-nan Wu <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>