aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-11-12LoongArch: Fix AP booting issue in VM modeBibo Mao1-0/+15
Native IPI is used for AP booting, because it is the booting interface between OS and BIOS firmware. The paravirt IPI is only used inside OS, and native IPI is necessary to boot AP. When booting AP, we write the kernel entry address in the HW mailbox of AP and send IPI interrupt to it. AP executes idle instruction and waits for interrupts or SW events, then clears IPI interrupt and jumps to the kernel entry from HW mailbox. Between writing HW mailbox and sending IPI, AP can be woken up by SW events and jumps to the kernel entry, so ACTION_BOOT_CPU IPI interrupt will keep pending during AP booting. And native IPI interrupt handler needs be registered so that it can clear pending native IPI, else there will be endless interrupts during AP booting stage. Here native IPI interrupt is initialized even if paravirt IPI is used. Cc: stable@vger.kernel.org Fixes: 74c16b2e2b0c ("LoongArch: KVM: Add PV IPI support on guest side") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: Add WriteCombine shadow mapping in KASANKanglong Wang2-1/+15
Currently, the kernel couldn't boot when ARCH_IOREMAP, ARCH_WRITECOMBINE and KASAN are enabled together. Because DMW2 is used by kernel now which is configured as 0xa000000000000000 for WriteCombine, but KASAN has no segment mapping for it. This patch fix this issue. Solution: Add the relevant definitions for WriteCombine (DMW2) in KASAN. Cc: stable@vger.kernel.org Fixes: 8e02c3b782ec ("LoongArch: Add writecombine support for DMW-based ioremap()") Signed-off-by: Kanglong Wang <wangkanglong@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabitsHuacai Chen2-3/+14
If PGDIR_SIZE is too large for cpu_vabits, KASAN_SHADOW_END will overflow UINTPTR_MAX because KASAN_SHADOW_START/KASAN_SHADOW_END are aligned up by PGDIR_SIZE. And then the overflowed KASAN_SHADOW_END looks like a user space address. For example, PGDIR_SIZE of CONFIG_4KB_4LEVEL is 2^39, which is too large for Loongson-2K series whose cpu_vabits = 39. Since CONFIG_4KB_4LEVEL is completely legal for CPUs with cpu_vabits <= 39, we just disable KASAN via early return in kasan_init(). Otherwise we get a boot failure. Moreover, we change KASAN_SHADOW_END from the first address after KASAN shadow area to the last address in KASAN shadow area, in order to avoid the end address exactly overflow to 0 (which is a legal case). We don't need to worry about alignment because pgd_addr_end() can handle it. Cc: stable@vger.kernel.org Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: Make KASAN work with 5-level page-tablesHuacai Chen1-3/+23
Make KASAN work with 5-level page-tables, including: 1. Implement and use __pgd_none() and kasan_p4d_offset(). 2. As done in kasan_pmd_populate() and kasan_pte_populate(), restrict the loop conditions of kasan_p4d_populate() and kasan_pud_populate() to avoid unnecessary population. Cc: stable@vger.kernel.org Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: Define a default value for VM_DATA_DEFAULT_FLAGSYuli Wang1-4/+1
This is a trivial cleanup, commit c62da0c35d58518d ("mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS") has unified default values of VM_DATA_DEFAULT_FLAGS across different platforms. Apply the same consistency to LoongArch. Suggested-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: Yuli Wang <wangyuli@uniontech.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: Fix early_numa_add_cpu() usage for FDT systemsHuacai Chen1-1/+1
early_numa_add_cpu() applies on physical CPU id rather than logical CPU id, so use cpuid instead of cpu. Cc: stable@vger.kernel.org Fixes: 3de9c42d02a79a5 ("LoongArch: Add all CPUs enabled by fdt to NUMA node 0") Reported-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: For all possible CPUs setup logical-physical CPU mappingHuacai Chen2-29/+55
In order to support ACPI-based physical CPU hotplug, we suppose for all "possible" CPUs cpu_logical_map() can work. Because some drivers want to use cpu_logical_map() for all "possible" CPUs, while currently we only setup logical-physical mapping for "present" CPUs. This lack of mapping also causes cpu_to_node() cannot work for hot-added CPUs. All "possible" CPUs are listed in MADT, and the "present" subset is marked as ACPI_MADT_ENABLED. To setup logical-physical CPU mapping for all possible CPUs and keep present CPUs continuous in cpu_present_mask, we parse MADT twice. The first pass handles CPUs with ACPI_MADT_ENABLED and the second pass handles CPUs without ACPI_MADT_ENABLED. The global flag (cpu_enumerated) is removed because acpi_map_cpu() calls cpu_number_map() rather than set_processor_mask() now. Reported-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-11Merge branch 'mlx5-misc-fixes-2024-11-07'Jakub Kicinski7-15/+54
Tariq Toukan says: ==================== mlx5 misc fixes 2024-11-07 This patchset provides misc bug fixes from the team to the mlx5 core and Eth drivers. ==================== Link: https://patch.msgid.link/20241107183527.676877-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: Disable loopback self-test on multi-PF netdevCarolina Jubran1-0/+4
In Multi-PF (Socket Direct) configurations, when a loopback packet is sent through one of the secondary devices, it will always be received on the primary device. This causes the loopback layer to fail in identifying the loopback packet as the devices are different. To avoid false test failures, disable the loopback self-test in Multi-PF configurations. Fixes: ed29705e4ed1 ("net/mlx5: Enable SD feature") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: CT: Fix null-ptr-deref in add rule err flowMoshe Shemesh1-1/+1
In error flow of mlx5_tc_ct_entry_add_rule(), in case ct_rule_add() callback returns error, zone_rule->attr is used uninitiated. Fix it to use attr which has the needed pointer value. Kernel log: BUG: kernel NULL pointer dereference, address: 0000000000000110 RIP: 0010:mlx5_tc_ct_entry_add_rule+0x2b1/0x2f0 [mlx5_core] … Call Trace: <TASK> ? __die+0x20/0x70 ? page_fault_oops+0x150/0x3e0 ? exc_page_fault+0x74/0x140 ? asm_exc_page_fault+0x22/0x30 ? mlx5_tc_ct_entry_add_rule+0x2b1/0x2f0 [mlx5_core] ? mlx5_tc_ct_entry_add_rule+0x1d5/0x2f0 [mlx5_core] mlx5_tc_ct_block_flow_offload+0xc6a/0xf90 [mlx5_core] ? nf_flow_offload_tuple+0xd8/0x190 [nf_flow_table] nf_flow_offload_tuple+0xd8/0x190 [nf_flow_table] flow_offload_work_handler+0x142/0x320 [nf_flow_table] ? finish_task_switch.isra.0+0x15b/0x2b0 process_one_work+0x16c/0x320 worker_thread+0x28c/0x3a0 ? __pfx_worker_thread+0x10/0x10 kthread+0xb8/0xf0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2d/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Fixes: 7fac5c2eced3 ("net/mlx5: CT: Avoid reusing modify header context for natted entries") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: clear xdp features on non-uplink representorsWilliam Tu1-1/+2
Non-uplink representor port does not support XDP. The patch clears the xdp feature by checking the net_device_ops.ndo_bpf is set or not. Verify using the netlink tool: $ tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump dev-get Representor netdev before the patch: {'ifindex': 8, 'xdp-features': {'basic', 'ndo-xmit', 'ndo-xmit-sg', 'redirect', 'rx-sg', 'xsk-zerocopy'}, 'xdp-rx-metadata-features': set(), 'xdp-zc-max-segs': 1, 'xsk-features': set()}, With the patch: {'ifindex': 8, 'xdp-features': set(), 'xdp-rx-metadata-features': set(), 'xsk-features': set()}, Fixes: 4d5ab0ad964d ("net/mlx5e: take into account device reconfiguration for xdp_features flag") Signed-off-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: kTLS, Fix incorrect page refcountingDragos Tatulea1-4/+4
The kTLS tx handling code is using a mix of get_page() and page_ref_inc() APIs to increment the page reference. But on the release path (mlx5e_ktls_tx_handle_resync_dump_comp()), only put_page() is used. This is an issue when using pages from large folios: the get_page() references are stored on the folio page while the page_ref_inc() references are stored directly in the given page. On release the folio page will be dereferenced too many times. This was found while doing kTLS testing with sendfile() + ZC when the served file was read from NFS on a kernel with NFS large folios support (commit 49b29a573da8 ("nfs: add support for large folios")). Fixes: 84d1bb2b139e ("net/mlx5e: kTLS, Limit DUMP wqe size") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: fs, lock FTE when checking if activeMark Bloch1-3/+12
The referenced commits introduced a two-step process for deleting FTEs: - Lock the FTE, delete it from hardware, set the hardware deletion function to NULL and unlock the FTE. - Lock the parent flow group, delete the software copy of the FTE, and remove it from the xarray. However, this approach encounters a race condition if a rule with the same match value is added simultaneously. In this scenario, fs_core may set the hardware deletion function to NULL prematurely, causing a panic during subsequent rule deletions. To prevent this, ensure the active flag of the FTE is checked under a lock, which will prevent the fs_core layer from attaching a new steering rule to an FTE that is in the process of deletion. [ 438.967589] MOSHE: 2496 mlx5_del_flow_rules del_hw_func [ 438.968205] ------------[ cut here ]------------ [ 438.968654] refcount_t: decrement hit 0; leaking memory. [ 438.969249] WARNING: CPU: 0 PID: 8957 at lib/refcount.c:31 refcount_warn_saturate+0xfb/0x110 [ 438.970054] Modules linked in: act_mirred cls_flower act_gact sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core zram zsmalloc fuse [last unloaded: cls_flower] [ 438.973288] CPU: 0 UID: 0 PID: 8957 Comm: tc Not tainted 6.12.0-rc1+ #8 [ 438.973888] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 438.974874] RIP: 0010:refcount_warn_saturate+0xfb/0x110 [ 438.975363] Code: 40 66 3b 82 c6 05 16 e9 4d 01 01 e8 1f 7c a0 ff 0f 0b c3 cc cc cc cc 48 c7 c7 10 66 3b 82 c6 05 fd e8 4d 01 01 e8 05 7c a0 ff <0f> 0b c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 [ 438.976947] RSP: 0018:ffff888124a53610 EFLAGS: 00010286 [ 438.977446] RAX: 0000000000000000 RBX: ffff888119d56de0 RCX: 0000000000000000 [ 438.978090] RDX: ffff88852c828700 RSI: ffff88852c81b3c0 RDI: ffff88852c81b3c0 [ 438.978721] RBP: ffff888120fa0e88 R08: 0000000000000000 R09: ffff888124a534b0 [ 438.979353] R10: 0000000000000001 R11: 0000000000000001 R12: ffff888119d56de0 [ 438.979979] R13: ffff888120fa0ec0 R14: ffff888120fa0ee8 R15: ffff888119d56de0 [ 438.980607] FS: 00007fe6dcc0f800(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 [ 438.983984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 438.984544] CR2: 00000000004275e0 CR3: 0000000186982001 CR4: 0000000000372eb0 [ 438.985205] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 438.985842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 438.986507] Call Trace: [ 438.986799] <TASK> [ 438.987070] ? __warn+0x7d/0x110 [ 438.987426] ? refcount_warn_saturate+0xfb/0x110 [ 438.987877] ? report_bug+0x17d/0x190 [ 438.988261] ? prb_read_valid+0x17/0x20 [ 438.988659] ? handle_bug+0x53/0x90 [ 438.989054] ? exc_invalid_op+0x14/0x70 [ 438.989458] ? asm_exc_invalid_op+0x16/0x20 [ 438.989883] ? refcount_warn_saturate+0xfb/0x110 [ 438.990348] mlx5_del_flow_rules+0x2f7/0x340 [mlx5_core] [ 438.990932] __mlx5_eswitch_del_rule+0x49/0x170 [mlx5_core] [ 438.991519] ? mlx5_lag_is_sriov+0x3c/0x50 [mlx5_core] [ 438.992054] ? xas_load+0x9/0xb0 [ 438.992407] mlx5e_tc_rule_unoffload+0x45/0xe0 [mlx5_core] [ 438.993037] mlx5e_tc_del_fdb_flow+0x2a6/0x2e0 [mlx5_core] [ 438.993623] mlx5e_flow_put+0x29/0x60 [mlx5_core] [ 438.994161] mlx5e_delete_flower+0x261/0x390 [mlx5_core] [ 438.994728] tc_setup_cb_destroy+0xb9/0x190 [ 438.995150] fl_hw_destroy_filter+0x94/0xc0 [cls_flower] [ 438.995650] fl_change+0x11a4/0x13c0 [cls_flower] [ 438.996105] tc_new_tfilter+0x347/0xbc0 [ 438.996503] ? ___slab_alloc+0x70/0x8c0 [ 438.996929] rtnetlink_rcv_msg+0xf9/0x3e0 [ 438.997339] ? __netlink_sendskb+0x4c/0x70 [ 438.997751] ? netlink_unicast+0x286/0x2d0 [ 438.998171] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 438.998625] netlink_rcv_skb+0x54/0x100 [ 438.999020] netlink_unicast+0x203/0x2d0 [ 438.999421] netlink_sendmsg+0x1e4/0x420 [ 438.999820] __sock_sendmsg+0xa1/0xb0 [ 439.000203] ____sys_sendmsg+0x207/0x2a0 [ 439.000600] ? copy_msghdr_from_user+0x6d/0xa0 [ 439.001072] ___sys_sendmsg+0x80/0xc0 [ 439.001459] ? ___sys_recvmsg+0x8b/0xc0 [ 439.001848] ? generic_update_time+0x4d/0x60 [ 439.002282] __sys_sendmsg+0x51/0x90 [ 439.002658] do_syscall_64+0x50/0x110 [ 439.003040] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 718ce4d601db ("net/mlx5: Consolidate update FTE for all removal changes") Fixes: cefc23554fc2 ("net/mlx5: Fix FTE cleanup") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: Fix msix vectors to respect platform limitParav Pandit1-5/+27
The number of PCI vectors allocated by the platform (which may be fewer than requested) is currently not honored when creating the SF pool; only the PCI MSI-X capability is considered. As a result, when a platform allocates fewer vectors (in non-dynamic mode) than requested, the PF and SF pools end up with an invalid vector range. This causes incorrect SF vector accounting, which leads to the following call trace when an invalid IRQ vector is allocated. This issue is resolved by ensuring that the platform's vector limit is respected for both the SF and PF pools. Workqueue: mlx5_vhca_event0 mlx5_sf_dev_add_active_work [mlx5_core] RIP: 0010:pci_irq_vector+0x23/0x80 RSP: 0018:ffffabd5cebd7248 EFLAGS: 00010246 RAX: ffff980880e7f308 RBX: ffff9808932fb880 RCX: 0000000000000001 RDX: 00000000000001ff RSI: 0000000000000200 RDI: ffff980880e7f308 RBP: 0000000000000200 R08: 0000000000000010 R09: ffff97a9116f0860 R10: 0000000000000002 R11: 0000000000000228 R12: ffff980897cd0160 R13: 0000000000000000 R14: ffff97a920fec0c0 R15: ffffabd5cebd72d0 FS: 0000000000000000(0000) GS:ffff97c7ff9c0000(0000) knlGS:0000000000000000 ? rescuer_thread+0x350/0x350 kthread+0x11b/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core.sf mlx5_core.sf.6: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced) mlx5_core.sf mlx5_core.sf.7: firmware version: 32.43.356 mlx5_core.sf mlx5_core.sf.6 enpa1s0f0s4: renamed from eth0 mlx5_core.sf mlx5_core.sf.7: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation") Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Amir Tzin <amirtz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: E-switch, unload IB representors when unloading ETH representorsChiara Meiohas1-1/+4
IB representors depend on ETH representors, so the IB representors should not exist without the ETH ones. When unloading the ETH representors, the corresponding IB representors should be also unloaded. The commit 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") introduced the use of the ib_device_set_netdev API in IB repsresentors. ib_device_set_netdev() increments the refcount of the representor's netdev when loading an IB representor and decrements it when unloading. Without the unloading of the IB representor, the refcount of the representor's netdev remains greater than 0, preventing it from being unregistered. The patch uncovered an underlying bug where the eth representor is unloaded, without unloading the IB representor. This issue happened when using multiport E-switch and rebooting, causing the shutdown to hang when unloading the ETH representor because the refcount of the representor's netdevice was greater than 0. Call trace: unregister_netdevice: waiting for eth3 to become free. Usage count = 2 ref_tracker: eth%d@00000000661d60f7 has 1/1 users at ib_device_set_netdev+0x160/0x2d0 [ib_core] mlx5_ib_vport_rep_load+0x104/0x3f0 [mlx5_ib] mlx5_eswitch_reload_ib_reps+0xfc/0x110 [mlx5_core] mlx5_mpesw_work+0x236/0x330 [mlx5_core] process_one_work+0x169/0x320 worker_thread+0x288/0x3a0 kthread+0xb8/0xe0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x11/0x20 Fixes: 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11Merge branch 'mptcp-fix-a-couple-of-races'Jakub Kicinski1-5/+11
Paolo Abeni says: ==================== mptcp: fix a couple of races The first patch addresses a division by zero issue reported by Eric, the second one solves a similar issue found by code inspection while investigating the former. ==================== Link: https://patch.msgid.link/cover.1731060874.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mptcp: cope racing subflow creation in mptcp_rcv_space_adjustPaolo Abeni1-1/+2
Additional active subflows - i.e. created by the in kernel path manager - are included into the subflow list before starting the 3whs. A racing recvmsg() spooling data received on an already established subflow would unconditionally call tcp_cleanup_rbuf() on all the current subflows, potentially hitting a divide by zero error on the newly created ones. Explicitly check that the subflow is in a suitable state before invoking tcp_cleanup_rbuf(). Fixes: c76c6956566f ("mptcp: call tcp_cleanup_rbuf on subflows") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/02374660836e1b52afc91966b7535c8c5f7bafb0.1731060874.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mptcp: error out earlier on disconnectPaolo Abeni1-4/+9
Eric reported a division by zero splat in the MPTCP protocol: Oops: divide error: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 1 UID: 0 PID: 6094 Comm: syz-executor317 Not tainted 6.12.0-rc5-syzkaller-00291-g05b92660cdfe #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 RIP: 0010:__tcp_select_window+0x5b4/0x1310 net/ipv4/tcp_output.c:3163 Code: f6 44 01 e3 89 df e8 9b 75 09 f8 44 39 f3 0f 8d 11 ff ff ff e8 0d 74 09 f8 45 89 f4 e9 04 ff ff ff e8 00 74 09 f8 44 89 f0 99 <f7> 7c 24 14 41 29 d6 45 89 f4 e9 ec fe ff ff e8 e8 73 09 f8 48 89 RSP: 0018:ffffc900041f7930 EFLAGS: 00010293 RAX: 0000000000017e67 RBX: 0000000000017e67 RCX: ffffffff8983314b RDX: 0000000000000000 RSI: ffffffff898331b0 RDI: 0000000000000004 RBP: 00000000005d6000 R08: 0000000000000004 R09: 0000000000017e67 R10: 0000000000003e80 R11: 0000000000000000 R12: 0000000000003e80 R13: ffff888031d9b440 R14: 0000000000017e67 R15: 00000000002eb000 FS: 00007feb5d7f16c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007feb5d8adbb8 CR3: 0000000074e4c000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __tcp_cleanup_rbuf+0x3e7/0x4b0 net/ipv4/tcp.c:1493 mptcp_rcv_space_adjust net/mptcp/protocol.c:2085 [inline] mptcp_recvmsg+0x2156/0x2600 net/mptcp/protocol.c:2289 inet_recvmsg+0x469/0x6a0 net/ipv4/af_inet.c:885 sock_recvmsg_nosec net/socket.c:1051 [inline] sock_recvmsg+0x1b2/0x250 net/socket.c:1073 __sys_recvfrom+0x1a5/0x2e0 net/socket.c:2265 __do_sys_recvfrom net/socket.c:2283 [inline] __se_sys_recvfrom net/socket.c:2279 [inline] __x64_sys_recvfrom+0xe0/0x1c0 net/socket.c:2279 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7feb5d857559 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007feb5d7f1208 EFLAGS: 00000246 ORIG_RAX: 000000000000002d RAX: ffffffffffffffda RBX: 00007feb5d8e1318 RCX: 00007feb5d857559 RDX: 000000800000000e RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007feb5d8e1310 R08: 0000000000000000 R09: ffffffff81000000 R10: 0000000000000100 R11: 0000000000000246 R12: 00007feb5d8e131c R13: 00007feb5d8ae074 R14: 000000800000000e R15: 00000000fffffdef and provided a nice reproducer. The root cause is the current bad handling of racing disconnect. After the blamed commit below, sk_wait_data() can return (with error) with the underlying socket disconnected and a zero rcv_mss. Catch the error and return without performing any additional operations on the current socket. Reported-by: Eric Dumazet <edumazet@google.com> Fixes: 419ce133ab92 ("tcp: allow again tcp_disconnect() when threads are waiting") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/8c82ecf71662ecbc47bf390f9905de70884c9f2d.1731060874.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net: clarify SO_DEVMEM_DONTNEED behavior in documentationMina Almasry1-0/+9
Document new behavior when the number of frags passed is too big. Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20241107210331.3044434-2-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net: fix SO_DEVMEM_DONTNEED looping too longMina Almasry1-18/+24
Exit early if we're freeing more than 1024 frags, to prevent looping too long. Also minor code cleanups: - Flip checks to reduce indentation. - Use sizeof(*tokens) everywhere for consistentcy. Cc: Yi Lai <yi1.lai@linux.intel.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241107210331.3044434-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11nommu: pass NULL argument to vma_iter_prealloc()Hajime Tazaki1-1/+1
When deleting a vma entry from a maple tree, it has to pass NULL to vma_iter_prealloc() in order to calculate internal state of the tree, but it passed a wrong argument. As a result, nommu kernels crashed upon accessing a vma iterator, such as acct_collect() reading the size of vma entries after do_munmap(). This commit fixes this issue by passing a right argument to the preallocation call. Link: https://lkml.kernel.org/r/20241108222834.3625217-1-thehajime@gmail.com Fixes: b5df09226450 ("mm: set up vma iterator for vma_iter_prealloc() calls") Signed-off-by: Hajime Tazaki <thehajime@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11ocfs2: fix UBSAN warning in ocfs2_verify_volume()Dmitry Antipov1-4/+9
Syzbot has reported the following splat triggered by UBSAN: UBSAN: shift-out-of-bounds in fs/ocfs2/super.c:2336:10 shift exponent 32768 is too large for 32-bit type 'int' CPU: 2 UID: 0 PID: 5255 Comm: repro Not tainted 6.12.0-rc4-syzkaller-00047-gc2ee9f594da8 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x241/0x360 ? __pfx_dump_stack_lvl+0x10/0x10 ? __pfx__printk+0x10/0x10 ? __asan_memset+0x23/0x50 ? lockdep_init_map_type+0xa1/0x910 __ubsan_handle_shift_out_of_bounds+0x3c8/0x420 ocfs2_fill_super+0xf9c/0x5750 ? __pfx_ocfs2_fill_super+0x10/0x10 ? __pfx_validate_chain+0x10/0x10 ? __pfx_validate_chain+0x10/0x10 ? validate_chain+0x11e/0x5920 ? __lock_acquire+0x1384/0x2050 ? __pfx_validate_chain+0x10/0x10 ? string+0x26a/0x2b0 ? widen_string+0x3a/0x310 ? string+0x26a/0x2b0 ? bdev_name+0x2b1/0x3c0 ? pointer+0x703/0x1210 ? __pfx_pointer+0x10/0x10 ? __pfx_format_decode+0x10/0x10 ? __lock_acquire+0x1384/0x2050 ? vsnprintf+0x1ccd/0x1da0 ? snprintf+0xda/0x120 ? __pfx_lock_release+0x10/0x10 ? do_raw_spin_lock+0x14f/0x370 ? __pfx_snprintf+0x10/0x10 ? set_blocksize+0x1f9/0x360 ? sb_set_blocksize+0x98/0xf0 ? setup_bdev_super+0x4e6/0x5d0 mount_bdev+0x20c/0x2d0 ? __pfx_ocfs2_fill_super+0x10/0x10 ? __pfx_mount_bdev+0x10/0x10 ? vfs_parse_fs_string+0x190/0x230 ? __pfx_vfs_parse_fs_string+0x10/0x10 legacy_get_tree+0xf0/0x190 ? __pfx_ocfs2_mount+0x10/0x10 vfs_get_tree+0x92/0x2b0 do_new_mount+0x2be/0xb40 ? __pfx_do_new_mount+0x10/0x10 __se_sys_mount+0x2d6/0x3c0 ? __pfx___se_sys_mount+0x10/0x10 ? do_syscall_64+0x100/0x230 ? __x64_sys_mount+0x20/0xc0 do_syscall_64+0xf3/0x230 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f37cae96fda Code: 48 8b 0d 51 ce 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1e ce 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007fff6c1aa228 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007fff6c1aa240 RCX: 00007f37cae96fda RDX: 00000000200002c0 RSI: 0000000020000040 RDI: 00007fff6c1aa240 RBP: 0000000000000004 R08: 00007fff6c1aa280 R09: 0000000000000000 R10: 00000000000008c0 R11: 0000000000000206 R12: 00000000000008c0 R13: 00007fff6c1aa280 R14: 0000000000000003 R15: 0000000001000000 </TASK> For a really damaged superblock, the value of 'i_super.s_blocksize_bits' may exceed the maximum possible shift for an underlying 'int'. So add an extra check whether the aforementioned field represents the valid block size, which is 512 bytes, 1K, 2K, or 4K. Link: https://lkml.kernel.org/r/20241106092100.2661330-1-dmantipov@yandex.ru Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reported-by: syzbot+56f7cd1abe4b8e475180@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=56f7cd1abe4b8e475180 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11nilfs2: fix null-ptr-deref in block_dirty_buffer tracepointRyusuke Konishi4-6/+2
When using the "block:block_dirty_buffer" tracepoint, mark_buffer_dirty() may cause a NULL pointer dereference, or a general protection fault when KASAN is enabled. This happens because, since the tracepoint was added in mark_buffer_dirty(), it references the dev_t member bh->b_bdev->bd_dev regardless of whether the buffer head has a pointer to a block_device structure. In the current implementation, nilfs_grab_buffer(), which grabs a buffer to read (or create) a block of metadata, including b-tree node blocks, does not set the block device, but instead does so only if the buffer is not in the "uptodate" state for each of its caller block reading functions. However, if the uptodate flag is set on a folio/page, and the buffer heads are detached from it by try_to_free_buffers(), and new buffer heads are then attached by create_empty_buffers(), the uptodate flag may be restored to each buffer without the block device being set to bh->b_bdev, and mark_buffer_dirty() may be called later in that state, resulting in the bug mentioned above. Fix this issue by making nilfs_grab_buffer() always set the block device of the super block structure to the buffer head, regardless of the state of the buffer's uptodate flag. Link: https://lkml.kernel.org/r/20241106160811.3316-3-konishi.ryusuke@gmail.com Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ubisectech Sirius <bugreport@valiantsec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11nilfs2: fix null-ptr-deref in block_touch_buffer tracepointRyusuke Konishi1-1/+0
Patch series "nilfs2: fix null-ptr-deref bugs on block tracepoints". This series fixes null pointer dereference bugs that occur when using nilfs2 and two block-related tracepoints. This patch (of 2): It has been reported that when using "block:block_touch_buffer" tracepoint, touch_buffer() called from __nilfs_get_folio_block() causes a NULL pointer dereference, or a general protection fault when KASAN is enabled. This happens because since the tracepoint was added in touch_buffer(), it references the dev_t member bh->b_bdev->bd_dev regardless of whether the buffer head has a pointer to a block_device structure. In the current implementation, the block_device structure is set after the function returns to the caller. Here, touch_buffer() is used to mark the folio/page that owns the buffer head as accessed, but the common search helper for folio/page used by the caller function was optimized to mark the folio/page as accessed when it was reimplemented a long time ago, eliminating the need to call touch_buffer() here in the first place. So this solves the issue by eliminating the touch_buffer() call itself. Link: https://lkml.kernel.org/r/20241106160811.3316-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20241106160811.3316-2-konishi.ryusuke@gmail.com Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: Ubisectech Sirius <bugreport@valiantsec.com> Closes: https://lkml.kernel.org/r/86bd3013-887e-4e38-960f-ca45c657f032.bugreport@valiantsec.com Reported-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9982fb8d18eba905abe2 Tested-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: page_alloc: move mlocked flag clearance into free_pages_prepare()Roman Gushchin2-14/+15
Syzbot reported a bad page state problem caused by a page being freed using free_page() still having a mlocked flag at free_pages_prepare() stage: BUG: Bad page state in process syz.5.504 pfn:61f45 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x61f45 flags: 0xfff00000080204(referenced|workingset|mlocked|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000080204 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 8443, tgid 8442 (syz.5.504), ts 201884660643, free_ts 201499827394 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537 prep_new_page mm/page_alloc.c:1545 [inline] get_page_from_freelist+0x303f/0x3190 mm/page_alloc.c:3457 __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4733 alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265 kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99 kvm_create_vm virt/kvm/kvm_main.c:1235 [inline] kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5488 [inline] kvm_dev_ioctl+0x12dc/0x2240 virt/kvm/kvm_main.c:5530 __do_compat_sys_ioctl fs/ioctl.c:1007 [inline] __se_compat_sys_ioctl+0x510/0xc90 fs/ioctl.c:950 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386 do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e page last free pid 8399 tgid 8399 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1108 [inline] free_unref_folios+0xf12/0x18d0 mm/page_alloc.c:2686 folios_put_refs+0x76c/0x860 mm/swap.c:1007 free_pages_and_swap_cache+0x5c8/0x690 mm/swap_state.c:335 __tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline] tlb_batch_pages_flush mm/mmu_gather.c:149 [inline] tlb_flush_mmu_free mm/mmu_gather.c:366 [inline] tlb_flush_mmu+0x3a3/0x680 mm/mmu_gather.c:373 tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:465 exit_mmap+0x496/0xc40 mm/mmap.c:1926 __mmput+0x115/0x390 kernel/fork.c:1348 exit_mm+0x220/0x310 kernel/exit.c:571 do_exit+0x9b2/0x28e0 kernel/exit.c:926 do_group_exit+0x207/0x2c0 kernel/exit.c:1088 __do_sys_exit_group kernel/exit.c:1099 [inline] __se_sys_exit_group kernel/exit.c:1097 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097 x64_sys_call+0x2634/0x2640 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Modules linked in: CPU: 0 UID: 0 PID: 8442 Comm: syz.5.504 Not tainted 6.12.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 bad_page+0x176/0x1d0 mm/page_alloc.c:501 free_page_is_bad mm/page_alloc.c:918 [inline] free_pages_prepare mm/page_alloc.c:1100 [inline] free_unref_page+0xed0/0xf20 mm/page_alloc.c:2638 kvm_destroy_vm virt/kvm/kvm_main.c:1327 [inline] kvm_put_kvm+0xc75/0x1350 virt/kvm/kvm_main.c:1386 kvm_vcpu_release+0x54/0x60 virt/kvm/kvm_main.c:4143 __fput+0x23f/0x880 fs/file_table.c:431 task_work_run+0x24f/0x310 kernel/task_work.c:239 exit_task_work include/linux/task_work.h:43 [inline] do_exit+0xa2f/0x28e0 kernel/exit.c:939 do_group_exit+0x207/0x2c0 kernel/exit.c:1088 __do_sys_exit_group kernel/exit.c:1099 [inline] __se_sys_exit_group kernel/exit.c:1097 [inline] __ia32_sys_exit_group+0x3f/0x40 kernel/exit.c:1097 ia32_sys_call+0x2624/0x2630 arch/x86/include/generated/asm/syscalls_32.h:253 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386 do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e RIP: 0023:0xf745d579 Code: Unable to access opcode bytes at 0xf745d54f. RSP: 002b:00000000f75afd6c EFLAGS: 00000206 ORIG_RAX: 00000000000000fc RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 00000000ffffff9c RDI: 00000000f744cff4 RBP: 00000000f717ae61 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> The problem was originally introduced by commit b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance"): it was focused on handling pagecache and anonymous memory and wasn't suitable for lower level get_page()/free_page() API's used for example by KVM, as with this reproducer. Fix it by moving the mlocked flag clearance down to free_page_prepare(). The bug itself if fairly old and harmless (aside from generating these warnings), aside from a small memory leak - "bad" pages are stopped from being allocated again. Link: https://lkml.kernel.org/r/20241106195354.270757-1-roman.gushchin@linux.dev Fixes: b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reported-by: syzbot+e985d3026c4fd041578e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6729f475.050a0220.701a.0019.GAE@google.com Acked-by: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11net: fix data-races around sk->sk_forward_allocWang Liang2-4/+2
Syzkaller reported this warning: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 16 at net/ipv4/af_inet.c:156 inet_sock_destruct+0x1c5/0x1e0 Modules linked in: CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc5 #26 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:inet_sock_destruct+0x1c5/0x1e0 Code: 24 12 4c 89 e2 5b 48 c7 c7 98 ec bb 82 41 5c e9 d1 18 17 ff 4c 89 e6 5b 48 c7 c7 d0 ec bb 82 41 5c e9 bf 18 17 ff 0f 0b eb 83 <0f> 0b eb 97 0f 0b eb 87 0f 0b e9 68 ff ff ff 66 66 2e 0f 1f 84 00 RSP: 0018:ffffc9000008bd90 EFLAGS: 00010206 RAX: 0000000000000300 RBX: ffff88810b172a90 RCX: 0000000000000007 RDX: 0000000000000002 RSI: 0000000000000300 RDI: ffff88810b172a00 RBP: ffff88810b172a00 R08: ffff888104273c00 R09: 0000000000100007 R10: 0000000000020000 R11: 0000000000000006 R12: ffff88810b172a00 R13: 0000000000000004 R14: 0000000000000000 R15: ffff888237c31f78 FS: 0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffc63fecac8 CR3: 000000000342e000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn+0x88/0x130 ? inet_sock_destruct+0x1c5/0x1e0 ? report_bug+0x18e/0x1a0 ? handle_bug+0x53/0x90 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? inet_sock_destruct+0x1c5/0x1e0 __sk_destruct+0x2a/0x200 rcu_do_batch+0x1aa/0x530 ? rcu_do_batch+0x13b/0x530 rcu_core+0x159/0x2f0 handle_softirqs+0xd3/0x2b0 ? __pfx_smpboot_thread_fn+0x10/0x10 run_ksoftirqd+0x25/0x30 smpboot_thread_fn+0xdd/0x1d0 kthread+0xd3/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- Its possible that two threads call tcp_v6_do_rcv()/sk_forward_alloc_add() concurrently when sk->sk_state == TCP_LISTEN with sk->sk_lock unlocked, which triggers a data-race around sk->sk_forward_alloc: tcp_v6_rcv tcp_v6_do_rcv skb_clone_and_charge_r sk_rmem_schedule __sk_mem_schedule sk_forward_alloc_add() skb_set_owner_r sk_mem_charge sk_forward_alloc_add() __kfree_skb skb_release_all skb_release_head_state sock_rfree sk_mem_uncharge sk_forward_alloc_add() sk_mem_reclaim // set local var reclaimable __sk_mem_reclaim sk_forward_alloc_add() In this syzkaller testcase, two threads call tcp_v6_do_rcv() with skb->truesize=768, the sk_forward_alloc changes like this: (cpu 1) | (cpu 2) | sk_forward_alloc ... | ... | 0 __sk_mem_schedule() | | +4096 = 4096 | __sk_mem_schedule() | +4096 = 8192 sk_mem_charge() | | -768 = 7424 | sk_mem_charge() | -768 = 6656 ... | ... | sk_mem_uncharge() | | +768 = 7424 reclaimable=7424 | | | sk_mem_uncharge() | +768 = 8192 | reclaimable=8192 | __sk_mem_reclaim() | | -4096 = 4096 | __sk_mem_reclaim() | -8192 = -4096 != 0 The skb_clone_and_charge_r() should not be called in tcp_v6_do_rcv() when sk->sk_state is TCP_LISTEN, it happens later in tcp_v6_syn_recv_sock(). Fix the same issue in dccp_v6_do_rcv(). Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Signed-off-by: Wang Liang <wangliang74@huawei.com> Link: https://patch.msgid.link/20241107023405.889239-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11Merge tag 'sched_ext-for-6.12-rc7-fixes' of ↵Linus Torvalds4-22/+44
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - The fair sched class currently has a bug where its balance() returns true telling the sched core that it has tasks to run but then NULL from pick_task(). This makes sched core call sched_ext's pick_task() without preceding balance() which can lead to stalls in partial mode. For now, work around by detecting the condition and forcing the CPU to go through another scheduling cycle. - Add a missing newline to an error message and fix drgn introspection tool which went out of sync. * tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx() sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type sched_ext: Add a missing newline at the end of an error message
2024-11-11Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds9-42/+45
Pull virtio fixes from Michael Tsirkin: "Several small bugfixes all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vdpa/mlx5: Fix error path during device add vp_vdpa: fix id_table array not null terminated error virtio_pci: Fix admin vq cleanup by using correct info pointer vDPA/ifcvf: Fix pci_read_config_byte() return code handling Fix typo in vringh_test.c vdpa: solidrun: Fix UB bug with devres vsock/virtio: Initialization of the dangling pointer occurring in vsk->trans
2024-11-11dm-cache: fix warnings about duplicate slab cachesMikulas Patocka3-24/+34
The commit 4c39529663b9 adds a warning about duplicate cache names if CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-cache code. The dm-cache code allocates a slab cache for each device. This commit changes it to allocate just one slab cache in the module init function. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
2024-11-11dm-bufio: fix warnings about duplicate slab cachesMikulas Patocka1-4/+8
The commit 4c39529663b9 adds a warning about duplicate cache names if CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-bufio code. The dm-bufio code allocates a slab cache with each client. It is not possible to preallocate the caches in the module init function because the size of auxiliary per-buffer data is not known at this point. So, this commit changes dm-bufio so that it appends a unique atomic value to the cache name, to avoid the warnings. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
2024-11-11cpufreq: intel_pstate: Rearrange locking in hybrid_init_cpu_capacity_scaling()Rafael J. Wysocki1-19/+16
Notice that hybrid_init_cpu_capacity_scaling() only needs to hold hybrid_capacity_lock around __hybrid_init_cpu_capacity_scaling() calls, so introduce a "locked" wrapper around the latter and call it from the former. This allows to drop a local variable and a label that are not needed any more. Also, rename __hybrid_init_cpu_capacity_scaling() to __hybrid_refresh_cpu_capacity_scaling() for consistency. Interestingly enough, this fixes a locking issue introduced by commit 929ebc93ccaa ("cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems") that put an arch_enable_hybrid_capacity_scale() call under hybrid_capacity_lock, which was a mistake because the latter is acquired in CPU hotplug paths and so it cannot be held around cpus_read_lock() calls. Link: https://lore.kernel.org/linux-pm/SJ1PR11MB6129EDBF22F8A90FC3A3EDC8B9582@SJ1PR11MB6129.namprd11.prod.outlook.com/ Fixes: 929ebc93ccaa ("cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-by: "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com> Link: https://patch.msgid.link/12554508.O9o76ZdvQC@rjwysocki.net [ rjw: Changelog update ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-11-11mm: count zeromap read and set for swapout and swapinBarry Song7-8/+43
When the proportion of folios from the zeromap is small, missing their accounting may not significantly impact profiling. However, it's easy to construct a scenario where this becomes an issue—for example, allocating 1 GB of memory, writing zeros from userspace, followed by MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out and swap-in counts seem to vanish into a black hole, potentially causing semantic ambiguity. On the other hand, Usama reported that zero-filled pages can exceed 10% in workloads utilizing zswap, while Hailong noted that some app in Android have more than 6% zero-filled pages. Before commit 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap"), both zswap and zRAM implemented similar optimizations, leading to these optimized-out pages being counted in either zswap or zRAM counters (with pswpin/pswpout also increasing for zRAM). With zeromap functioning prior to both zswap and zRAM, userspace will no longer detect these swap-out and swap-in actions. We have three ways to address this: 1. Introduce a dedicated counter specifically for the zeromap. 2. Use pswpin/pswpout accounting, treating the zero map as a standard backend. This approach aligns with zRAM's current handling of same-page fills at the device level. However, it would mean losing the optimized-out page counters previously available in zRAM and would not align with systems using zswap. Additionally, as noted by Nhat Pham, pswpin/pswpout counters apply only to I/O done directly to the backend device. 3. Count zeromap pages under zswap, aligning with system behavior when zswap is enabled. However, this would not be consistent with zRAM, nor would it align with systems lacking both zswap and zRAM. Given the complications with options 2 and 3, this patch selects option 1. We can find these counters from /proc/vmstat (counters for the whole system) and memcg's memory.stat (counters for the interested memcg). For example: $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat swpin_zero 1648 swpout_zero 33536 $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat swpin_zero 3905 swpout_zero 3985 This patch does not address any specific zeromap bug, but the missing swpout and swpin counts for zero-filled pages can be highly confusing and may mislead user-space agents that rely on changes in these counters as indicators. Therefore, we add a Fixes tag to encourage the inclusion of this counter in any kernel versions with zeromap. Many thanks to Kanchana for the contribution of changing count_objcg_event() to count_objcg_events() to support large folios[1], which has now been incorporated into this patch. [1] https://lkml.kernel.org/r/20241001053222.6944-5-kanchana.p.sridhar@intel.com Link: https://lkml.kernel.org/r/20241107011246.59137-1-21cnbao@gmail.com Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") Co-developed-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Hailong Liu <hailong.liu@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Andi Kleen <ak@linux.intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11bcachefs: Allow for unknown key types in backpointers fsckKent Overstreet1-4/+6
We can't assume that btrees only contain keys of a given type - even if they only have a single key type listed in the allowed key types for that btree; this is a forwards compatibility issue. Reported-by: syzbot+a27c3aaa3640dd3e1dfb@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-11bcachefs: Fix assertion pop in topology repairKent Overstreet2-2/+3
Fixes: baefd3f849ed ("bcachefs: btree_cache.freeable list fixes") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-10Linux 6.12-rc7Linus Torvalds1-1/+1
2024-11-10Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds3-9/+9
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fixes from Stephen Boyd: "A handful of Qualcomm clk driver fixes: - Correct flags for X Elite USB MP GDSC and pcie pipediv2 clocks - Fix alpha PLL post_div mask for the cases where width is not specified - Avoid hangs in the SM8350 video driver (venus) by setting HW_CTRL trigger feature on the video clocks" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: qcom: gcc-x1e80100: Fix USB MP SS1 PHY GDSC pwrsts flags clk: qcom: gcc-x1e80100: Fix halt_check for pipediv2 clocks clk: qcom: clk-alpha-pll: Fix pll post div mask when width is not set clk: qcom: videocc-sm8350: use HW_CTRL_TRIGGER for vcodec GDSCs
2024-11-10Merge tag 'i2c-for-6.12-rc7' of ↵Linus Torvalds3-4/+7
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "i2c-host fixes for v6.12-rc7 (from Andi): - Fix designware incorrect behavior when concluding a transmission - Fix Mule multiplexer error value evaluation" * tag 'i2c-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: designware: do not hold SCL low when I2C_DYNAMIC_TAR_UPDATE is not set i2c: muxes: Fix return value check in mule_i2c_mux_probe()
2024-11-10filemap: Fix bounds checking in filemap_read()Trond Myklebust1-1/+1
If the caller supplies an iocb->ki_pos value that is close to the filesystem upper limit, and an iterator with a count that causes us to overflow that limit, then filemap_read() enters an infinite loop. This behaviour was discovered when testing xfstests generic/525 with the "localio" optimisation for loopback NFS mounts. Reported-by: Mike Snitzer <snitzer@kernel.org> Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()") Tested-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-11-10Merge tag 'irq_urgent_for_v6.12_rc7' of ↵Linus Torvalds1-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Borislav Petkov: - Make sure GICv3 controller interrupt activation doesn't race with a concurrent deactivation due to propagation delays of the register write * tag 'irq_urgent_for_v6.12_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3: Force propagation of the active state with a read-back
2024-11-10Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of ↵Linus Torvalds30-172/+329
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes, 14 of which are cc:stable. Three affect DAMON. Lorenzo's five-patch series to address the mmap_region error handling is here also. Apart from that, various singletons" * tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: add entry for Thorsten Blum ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove() signal: restore the override_rlimit logic fs/proc: fix compile warning about variable 'vmcore_mmap_ops' ucounts: fix counter leak in inc_rlimit_get_ucounts() selftests: hugetlb_dio: check for initial conditions to skip in the start mm: fix docs for the kernel parameter ``thp_anon=`` mm/damon/core: avoid overflow in damon_feed_loop_next_input() mm/damon/core: handle zero schemes apply interval mm/damon/core: handle zero {aggregation,ops_update} intervals mm/mlock: set the correct prev on failure objpool: fix to make percpu slot allocation more robust mm/page_alloc: keep track of free highatomic mm: resolve faulty mmap_region() error path behaviour mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling mm: refactor map_deny_write_exec() mm: unconditionally close VMAs on error mm: avoid unsafe VMA hook invocation when error arises on mmap hook mm/thp: fix deferred split unqueue naming and locking mm/thp: fix deferred split queue not partially_mapped
2024-11-10Merge tag 'usb-6.12-rc7' of ↵Linus Torvalds9-24/+33
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB/Thunderbolt fixes from Greg KH: "Here are some small remaining USB and Thunderbolt fixes and device ids for 6.12-rc7. Included in here are: - new USB serial driver device ids - thunderbolt driver fixes for reported problems - typec bugfixes - dwc3 driver fix - musb driver fix All of these have been in linux-next this past week with no reported issues" * tag 'usb-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: USB: serial: qcserial: add support for Sierra Wireless EM86xx thunderbolt: Fix connection issue with Pluggable UD-4VPD dock usb: typec: fix potential out of bounds in ucsi_ccg_update_set_new_cam_cmd() usb: dwc3: fix fault at system suspend if device was already runtime suspended usb: typec: qcom-pmic: init value of hdr_len/txbuf_len earlier usb: musb: sunxi: Fix accessing an released usb phy USB: serial: io_edgeport: fix use after free in debug printk USB: serial: option: add Quectel RG650V USB: serial: option: add Fibocom FG132 0x0112 composition thunderbolt: Add only on-board retimers when !CONFIG_USB4_DEBUGFS_MARGINING
2024-11-10Merge tag 'staging-6.12-rc7' of ↵Linus Torvalds1-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging driver fixes from Greg KH: "Here are two small memory leak fixes for the vchiq_arm staging driver that have been sitting in my tree for weeks and should get merged for 6.12-rc7 so that people don't keep tripping over them. They both have been in linux-next for a while with no reported problems" * tag 'staging-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: vchiq_arm: Use devm_kzalloc() for drv_mgmt allocation staging: vchiq_arm: Use devm_kzalloc() for vchiq_arm_state allocation
2024-11-09selftests: net: add netlink-dumps to .gitignoreJakub Kicinski1-0/+1
Commit 55d42a0c3f9c ("selftests: net: add a test for closing a netlink socket ith dump in progress") added a new test but did not add it to gitignore. Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20241108004731.2979878-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09Merge tag 'i2c-host-fixes-6.12-rc7' of ↵Wolfram Sang3-4/+7
git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current i2c-host fixes for v6.12-rc7 In designware an incorrect behavior has been fixes when concluding a transmission. Fixed return error value evaluation in the Mule multiplexer.
2024-11-09net: vertexcom: mse102x: Fix tx_bytes calculationStefan Wahren1-1/+3
The tx_bytes should consider the actual size of the Ethernet frames without the SPI encapsulation. But we still need to take care of Ethernet padding. Fixes: 2f207cbf0dd4 ("net: vertexcom: Add MSE102x SPI support") Signed-off-by: Stefan Wahren <wahrenst@gmx.net> Link: https://patch.msgid.link/20241108114343.6174-3-wahrenst@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09Merge tag 'nfsd-6.12-4' of ↵Linus Torvalds1-8/+5
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD * tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix READDIR on NFSv3 mounts of ext4 exports
2024-11-09Merge tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds1-3/+11
Pull smb client fix from Steve French: "Fix net namespace refcount use after free issue" * tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6: smb: client: Fix use-after-free of network namespace.
2024-11-09Merge tag 'block-6.12-20241108' of git://git.kernel.dk/linuxLinus Torvalds1-7/+14
Pull block fix from Jens Axboe: "Single fix for an issue triggered with PROVE_RCU=y, with nvme using the wrong iterators for an SRCU protected list" * tag 'block-6.12-20241108' of git://git.kernel.dk/linux: nvme/host: Fix RCU list traversal to use SRCU primitive
2024-11-09sched_ext: Handle cases where pick_task_scx() is called without preceding ↵Tejun Heo3-20/+42
balance_scx() sched_ext dispatches tasks from the BPF scheduler from balance_scx() and thus every pick_task_scx() call must be preceded by balance_scx(). While this usually holds, due to a bug, there are cases where the fair class's balance() returns true indicating that it has tasks to run on the CPU and thus terminating balance() calls but fails to actually find the next task to run when pick_task() is called. In such cases, pick_task_scx() can be called without preceding balance_scx(). Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep running the previous task if possible and avoid stalling from entering idle without balancing. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/Ztj_h5c2LYsdXYbA@slm.duckdns.org
2024-11-09landlock: Optimize scope enforcementMickaël Salaün1-3/+15
Do not walk through the domain hierarchy when the required scope is not supported by this domain. This is the same approach as for filesystem and network restrictions. Cc: Mikhail Ivanov <ivanov.mikhail1@huawei-partners.com> Cc: Tahera Fahimi <fahimitahera@gmail.com> Reviewed-by: Günther Noack <gnoack@google.com> Link: https://lore.kernel.org/r/20241109110856.222842-4-mic@digikod.net Signed-off-by: Mickaël Salaün <mic@digikod.net>