aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-11-22sfc: fix potential memleak in __ef100_hard_start_xmit()Zhang Changzhong1-0/+1
The __ef100_hard_start_xmit() returns NETDEV_TX_OK without freeing skb in error handling case, add dev_kfree_skb_any() to fix it. Fixes: 51b35a454efd ("sfc: skeleton EF100 PF driver") Signed-off-by: Zhang Changzhong <[email protected]> Acked-by: Martin Habets <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2022-11-22net: wwan: iosm: use ACPI_FREE() but not kfree() in ipc_pcie_read_bios_cfg()Wang ShaoBo1-1/+1
acpi_evaluate_dsm() should be coupled with ACPI_FREE() to free the ACPI memory, because we need to track the allocation of acpi_object when ACPI_DBG_TRACK_ALLOCATIONS enabled, so use ACPI_FREE() instead of kfree(). Fixes: d38a648d2d6c ("net: wwan: iosm: fix memory leak in ipc_pcie_read_bios_cfg") Signed-off-by: Wang ShaoBo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2022-11-22Merge tag 'gvt-fixes-2022-11-11' of https://github.com/intel/gvt-linux into ↵Tvrtko Ursulin1-5/+3
drm-intel-fixes gvt-fixes-2022-11-11 - kvm reference fix from Sean Signed-off-by: Tvrtko Ursulin <[email protected]> From: Zhenyu Wang <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-11-22xfrm: Fix ignored return value in xfrm6_init()Chen Zhongjin1-1/+5
When IPv6 module initializing in xfrm6_init(), register_pernet_subsys() is possible to fail but its return value is ignored. If IPv6 initialization fails later and xfrm6_fini() is called, removing uninitialized list in xfrm6_net_ops will cause null-ptr-deref: KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 1 PID: 330 Comm: insmod RIP: 0010:unregister_pernet_operations+0xc9/0x450 Call Trace: <TASK> unregister_pernet_subsys+0x31/0x3e xfrm6_fini+0x16/0x30 [ipv6] ip6_route_init+0xcd/0x128 [ipv6] inet6_init+0x29c/0x602 [ipv6] ... Fix it by catching the error return value of register_pernet_subsys(). Fixes: 8d068875caca ("xfrm: make gc_thresh configurable in all namespaces") Signed-off-by: Chen Zhongjin <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2022-11-22xfrm: Fix oops in __xfrm_state_delete()Thomas Jarosch1-1/+1
Kernel 5.14 added a new "byseq" index to speed up xfrm_state lookups by sequence number in commit fe9f1d8779cb ("xfrm: add state hashtable keyed by seq") While the patch was thorough, the function pfkey_send_new_mapping() in net/af_key.c also modifies x->km.seq and never added the current xfrm_state to the "byseq" index. This leads to the following kernel Ooops: BUG: kernel NULL pointer dereference, address: 0000000000000000 .. RIP: 0010:__xfrm_state_delete+0xc9/0x1c0 .. Call Trace: <TASK> xfrm_state_delete+0x1e/0x40 xfrm_del_sa+0xb0/0x110 [xfrm_user] xfrm_user_rcv_msg+0x12d/0x270 [xfrm_user] ? remove_entity_load_avg+0x8a/0xa0 ? copy_to_user_state_extra+0x580/0x580 [xfrm_user] netlink_rcv_skb+0x51/0x100 xfrm_netlink_rcv+0x30/0x50 [xfrm_user] netlink_unicast+0x1a6/0x270 netlink_sendmsg+0x22a/0x480 __sys_sendto+0x1a6/0x1c0 ? __audit_syscall_entry+0xd8/0x130 ? __audit_syscall_exit+0x249/0x2b0 __x64_sys_sendto+0x23/0x30 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb Exact location of the crash in __xfrm_state_delete(): if (x->km.seq) hlist_del_rcu(&x->byseq); The hlist_node "byseq" was never populated. The bug only triggers if a new NAT traversal mapping (changed IP or port) is detected in esp_input_done2() / esp6_input_done2(), which in turn indirectly calls pfkey_send_new_mapping() *if* the kernel is compiled with CONFIG_NET_KEY and "af_key" is active. The PF_KEYv2 message SADB_X_NAT_T_NEW_MAPPING is not part of RFC 2367. Various implementations have been examined how they handle the "sadb_msg_seq" header field: - racoon (Android): does not process SADB_X_NAT_T_NEW_MAPPING - strongswan: does not care about sadb_msg_seq - openswan: does not care about sadb_msg_seq There is no standard how PF_KEYv2 sadb_msg_seq should be populated for SADB_X_NAT_T_NEW_MAPPING and it's not used in popular implementations either. Herbert Xu suggested we should just use the current km.seq value as is. This fixes the root cause of the oops since we no longer modify km.seq itself. The update of "km.seq" looks like a copy'n'paste error from pfkey_send_acquire(). SADB_ACQUIRE must indeed assign a unique km.seq number according to RFC 2367. It has been verified that code paths involving pfkey_send_acquire() don't cause the same Oops. PF_KEYv2 SADB_X_NAT_T_NEW_MAPPING support was originally added here: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git commit cbc3488685b20e7b2a98ad387a1a816aada569d8 Author: Derek Atkins <[email protected]> AuthorDate: Wed Apr 2 13:21:02 2003 -0800 [IPSEC]: Implement UDP Encapsulation framework. In particular, implement ESPinUDP encapsulation for IPsec Nat Traversal. A note on triggering the bug: I was not able to trigger it using VMs. There is one VPN using a high latency link on our production VPN server that triggered it like once a day though. Link: https://github.com/strongswan/strongswan/issues/992 Link: https://lore.kernel.org/netdev/00959f33ee52c4b3b0084d42c430418e502db554.1652340703.git.antony.antony@secunet.com/T/ Link: https://lore.kernel.org/netdev/[email protected]/T/ Fixes: fe9f1d8779cb ("xfrm: add state hashtable keyed by seq") Reported-by: Roth Mark <[email protected]> Reported-by: Zhihao Chen <[email protected]> Tested-by: Roth Mark <[email protected]> Signed-off-by: Thomas Jarosch <[email protected]> Acked-by: Antony Antony <[email protected]> Acked-by: Herbert Xu <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2022-11-22zonefs: Fix race between modprobe and mountZhang Xiaoxu1-6/+6
There is a race between modprobe and mount as below: modprobe zonefs | mount -t zonefs --------------------------------|------------------------- zonefs_init | register_filesystem [1] | | zonefs_fill_super [2] zonefs_sysfs_init [3] | 1. register zonefs suceess, then 2. user can mount the zonefs 3. if sysfs initialize failed, the module initialize failed. Then the mount process maybe some error happened since the module initialize failed. Let's register zonefs after all dependency resource ready. And reorder the dependency resource release in module exit. Fixes: 9277a6d4fbd4 ("zonefs: Export open zone resource information through sysfs") Signed-off-by: Zhang Xiaoxu <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Damien Le Moal <[email protected]>
2022-11-21ice: fix handling of burst Tx timestampsJacob Keller2-16/+16
Commit 1229b33973c7 ("ice: Add low latency Tx timestamp read") refactored PTP timestamping logic to use a threaded IRQ instead of a separate kthread. This implementation introduced ice_misc_intr_thread_fn and redefined the ice_ptp_process_ts function interface to return a value of whether or not the timestamp processing was complete. ice_misc_intr_thread_fn would take the return value from ice_ptp_process_ts and convert it into either IRQ_HANDLED if there were no more timestamps to be processed, or IRQ_WAKE_THREAD if the thread should continue processing. This is not correct, as the kernel does not re-schedule threaded IRQ functions automatically. IRQ_WAKE_THREAD can only be used by the main IRQ function. This results in the ice_ptp_process_ts function (and in turn the ice_ptp_tx_tstamp function) from only being called exactly once per interrupt. If an application sends a burst of Tx timestamps without waiting for a response, the interrupt will trigger for the first timestamp. However, later timestamps may not have arrived yet. This can result in dropped or discarded timestamps. Worse, on E822 hardware this results in the interrupt logic getting stuck such that no future interrupts will be triggered. The result is complete loss of Tx timestamp functionality. Fix this by modifying the ice_misc_intr_thread_fn to perform its own polling of the ice_ptp_process_ts function. We sleep for a few microseconds between attempts to avoid wasting significant CPU time. The value was chosen to allow time for the Tx timestamps to complete without wasting so much time that we overrun application wait budgets in the worst case. The ice_ptp_process_ts function also currently returns false in the event that the Tx tracker is not initialized. This would result in the threaded IRQ handler never exiting if it gets started while the tracker is not initialized. Fix the function to appropriately return true when the tracker is not initialized. Note that this will not reproduce with default ptp4l behavior, as the program always synchronously waits for a timestamp response before sending another timestamp request. Reported-by: Siddaraju DH <[email protected]> Fixes: 1229b33973c7 ("ice: Add low latency Tx timestamp read") Signed-off-by: Jacob Keller <[email protected]> Tested-by: Gurucharan G <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-21tipc: check skb_linearize() return value in tipc_disc_rcv()YueHaibing1-1/+4
If skb_linearize() fails in tipc_disc_rcv(), we need to free the skb instead of handle it. Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash values") Signed-off-by: YueHaibing <[email protected]> Acked-by: Jon Maloy <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-21Merge branch '40GbE' of ↵Jakub Kicinski2-19/+23
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2022-11-18 (iavf) Ivan Vecera resolves issues related to reset by adding back call to netif_tx_stop_all_queues() and adding calls to dev_close() to ensure device is properly closed during reset. Stefan Assmann removes waiting for setting of MAC address as this breaks ARP. Slawomir adds setting of __IAVF_IN_REMOVE_TASK bit to prevent deadlock between remove and shutdown. * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: iavf: Fix race condition between iavf_shutdown and iavf_remove iavf: remove INITIAL_MAC_SET to allow gARP to work properly iavf: Do not restart Tx queues after reset task failure iavf: Fix a crash during reset task ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-21Merge branch 'tipc-fix-two-race-issues-in-tipc_conn_alloc'Jakub Kicinski1-9/+11
Xin Long says: ==================== tipc: fix two race issues in tipc_conn_alloc The race exists beteen tipc_topsrv_accept() and tipc_conn_close(), one is allocating the con while the other is freeing it and there is no proper lock protecting it. Therefore, a null-pointer-defer and a use-after-free may be triggered, see details on each patch. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-21tipc: add an extra conn_get in tipc_conn_allocXin Long1-3/+6
One extra conn_get() is needed in tipc_conn_alloc(), as after tipc_conn_alloc() is called, tipc_conn_close() may free this con before deferencing it in tipc_topsrv_accept(): tipc_conn_alloc(); newsk = newsock->sk; <---- tipc_conn_close(); write_lock_bh(&sk->sk_callback_lock); newsk->sk_data_ready = tipc_conn_data_ready; Then an uaf issue can be triggered: BUG: KASAN: use-after-free in tipc_topsrv_accept+0x1e7/0x370 [tipc] Call Trace: <TASK> dump_stack_lvl+0x33/0x46 print_report+0x178/0x4b0 kasan_report+0x8c/0x100 kasan_check_range+0x179/0x1e0 tipc_topsrv_accept+0x1e7/0x370 [tipc] process_one_work+0x6a3/0x1030 worker_thread+0x8a/0xdf0 This patch fixes it by holding it in tipc_conn_alloc(), then after all accessing in tipc_topsrv_accept() releasing it. Note when does this in tipc_topsrv_kern_subscr(), as tipc_conn_rcv_sub() returns 0 or -1 only, we don't need to check for "> 0". Fixes: c5fa7b3cf3cb ("tipc: introduce new TIPC server infrastructure") Signed-off-by: Xin Long <[email protected]> Acked-by: Jon Maloy <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-21tipc: set con sock in tipc_conn_allocXin Long1-6/+5
A crash was reported by Wei Chen: BUG: kernel NULL pointer dereference, address: 0000000000000018 RIP: 0010:tipc_conn_close+0x12/0x100 Call Trace: tipc_topsrv_exit_net+0x139/0x320 ops_exit_list.isra.9+0x49/0x80 cleanup_net+0x31a/0x540 process_one_work+0x3fa/0x9f0 worker_thread+0x42/0x5c0 It was caused by !con->sock in tipc_conn_close(). In tipc_topsrv_accept(), con is allocated in conn_idr then its sock is set: con = tipc_conn_alloc(); ... <----[1] con->sock = newsock; If tipc_conn_close() is called in anytime of [1], the null-pointer-def is triggered by con->sock->sk due to con->sock is not yet set. This patch fixes it by moving the con->sock setting to tipc_conn_alloc() under s->idr_lock. So that con->sock can never be NULL when getting the con from s->conn_idr. It will be also safer to move con->server and flag CF_CONNECTED setting under s->idr_lock, as they should all be set before tipc_conn_alloc() is called. Fixes: c5fa7b3cf3cb ("tipc: introduce new TIPC server infrastructure") Reported-by: Wei Chen <[email protected]> Signed-off-by: Xin Long <[email protected]> Acked-by: Jon Maloy <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-21net: phy: at803x: fix error return code in at803x_probe()Wei Yongjun1-1/+3
Fix to return a negative error code from the ccr read error handling case instead of 0, as done elsewhere in this function. Fixes: 3265f4218878 ("net: phy: at803x: add fiber support") Signed-off-by: Wei Yongjun <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-21net/mlx5e: Fix possible race condition in macsec extended packet number ↵Emeel Hakim1-0/+3
update routine Currenty extended packet number (EPN) update routine is accessing macsec object without holding the general macsec lock hence facing a possible race condition when an EPN update occurs while updating or deleting the SA. Fix by holding the general macsec lock before accessing the object. Fixes: 4411a6c0abd3 ("net/mlx5e: Support MACsec offload extended packet number (EPN)") Signed-off-by: Emeel Hakim <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5e: Fix MACsec update SecYEmeel Hakim1-1/+1
Currently updating SecY destroys and re-creates RX SA objects, the re-created RX SA objects are not identical to the destroyed objects and it disagree on the encryption enabled property which holds the value false after recreation, this value is not supported with offload which leads to no traffic after an update. Fix by recreating an identical objects. Fixes: 5a39816a75e5 ("net/mlx5e: Add MACsec offload SecY support") Signed-off-by: Emeel Hakim <[email protected]> Reviewed-by: Raed Salem <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5e: Fix MACsec SA initialization routineEmeel Hakim1-7/+7
Currently as part of MACsec SA initialization routine extended packet number (EPN) object attribute is always being set without checking if EPN is actually enabled, the above could lead to a NULL dereference. Fix by adding such a check. Fixes: 4411a6c0abd3 ("net/mlx5e: Support MACsec offload extended packet number (EPN)") Signed-off-by: Emeel Hakim <[email protected]> Reviewed-by: Raed Salem <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5e: Remove leftovers from old XSK queues enumerationTariq Toukan1-18/+0
Before the cited commit, for N channels, a dedicated set of N queues was created to support XSK, in indices [N, 2N-1], doubling the number of queues. In addition, changing the number of channels was prohibited, as it would shift the indices. Remove these two leftovers, as we moved XSK to a new queueing scheme, starting from index 0. Fixes: 3db4c85cde7a ("net/mlx5e: xsk: Use queue indices starting from 0 for XSK queues") Signed-off-by: Tariq Toukan <[email protected]> Reviewed-by: Gal Pressman <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5e: Offload rule only when all encaps are validChris Mi3-17/+9
The cited commit adds a for loop to support multiple encapsulations. But it only checks if the last encap is valid. Fix it by setting slow path flag when one of the encap is invalid. Fixes: f493f15534ec ("net/mlx5e: Move flow attr reformat action bit to per dest flags") Signed-off-by: Chris Mi <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5e: Fix missing alignment in size of MTT/KLM entriesTariq Toukan1-2/+3
In the cited patch, an alignment required by the HW spec was mistakenly dropped. Bring it back to fix error completions like the below: mlx5_core 0000:00:08.0 eth2: Error cqe on cqn 0x40b, ci 0x0, qn 0x104f, opcode 0xd, syndrome 0x2, vendor syndrome 0x68 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 86 00 68 02 25 00 10 4f 00 00 bb d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0x0, len: 192 00000000: 00 00 00 25 00 10 4f 0c 00 00 00 00 00 18 2e 00 00000010: 90 00 00 00 00 02 00 00 00 00 00 00 20 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000080: 08 00 00 00 48 6a 00 02 08 00 00 00 0e 10 00 02 00000090: 08 00 00 00 0c db 00 02 08 00 00 00 0e 82 00 02 000000a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fixes: 9f123f740428 ("net/mlx5e: Improve MTT/KSM alignment") Signed-off-by: Tariq Toukan <[email protected]> Reviewed-by: Gal Pressman <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5: Fix sync reset event handler error flowMoshe Shemesh1-2/+7
When sync reset now event handling fails on mlx5_pci_link_toggle() then no reset was done. However, since mlx5_cmd_fast_teardown_hca() was already done, the firmware function is closed and the driver is left without firmware functionality. Fix it by setting device error state and reopen the firmware resources. Reopening is done by the thread that was called for devlink reload fw_activate as it already holds the devlink lock. Fixes: 5ec697446f46 ("net/mlx5: Add support for devlink reload action fw activate") Signed-off-by: Moshe Shemesh <[email protected]> Reviewed-by: Aya Levin <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5: E-Switch, Set correctly vport destinationRoi Dayan2-5/+7
The cited commit moved from using reformat_id integer to packet_reformat pointer which introduced the possibility to null pointer dereference. When setting packet reformat flag and pkt_reformat pointer must exists so checking MLX5_ESW_DEST_ENCAP is not enough, we need to make sure the pkt_reformat is valid and check for MLX5_ESW_DEST_ENCAP_VALID. If the dest encap valid flag does not exists then pkt_reformat can be either invalid address or null. Also, to make sure we don't try to access invalid pkt_reformat set it to null when invalidated and invalidate it before calling add flow code as its logically more correct and to be safe. Fixes: 2b688ea5efde ("net/mlx5: Add flow steering actions to fs_cmd shim layer") Signed-off-by: Roi Dayan <[email protected]> Reviewed-by: Chris Mi <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5: Lag, avoid lockdep warningsEli Cohen4-40/+78
ldev->lock is used to serialize lag change operations. Since multiport eswtich functionality was added, we now change the mode dynamically. However, acquiring ldev->lock is not allowed as it could possibly lead to a deadlock as reported by the lockdep mechanism. [ 836.154963] WARNING: possible circular locking dependency detected [ 836.155850] 5.19.0-rc5_net_56b7df2 #1 Not tainted [ 836.156549] ------------------------------------------------------ [ 836.157418] handler1/12198 is trying to acquire lock: [ 836.158178] ffff888187d52b58 (&ldev->lock){+.+.}-{3:3}, at: mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.159575] [ 836.159575] but task is already holding lock: [ 836.160474] ffff8881d4de2930 (&block->cb_lock){++++}-{3:3}, at: tc_setup_cb_add+0x5b/0x200 [ 836.161669] which lock already depends on the new lock. [ 836.162905] [ 836.162905] the existing dependency chain (in reverse order) is: [ 836.164008] -> #3 (&block->cb_lock){++++}-{3:3}: [ 836.164946] down_write+0x25/0x60 [ 836.165548] tcf_block_get_ext+0x1c6/0x5d0 [ 836.166253] ingress_init+0x74/0xa0 [sch_ingress] [ 836.167028] qdisc_create.constprop.0+0x130/0x5e0 [ 836.167805] tc_modify_qdisc+0x481/0x9f0 [ 836.168490] rtnetlink_rcv_msg+0x16e/0x5a0 [ 836.169189] netlink_rcv_skb+0x4e/0xf0 [ 836.169861] netlink_unicast+0x190/0x250 [ 836.170543] netlink_sendmsg+0x243/0x4b0 [ 836.171226] sock_sendmsg+0x33/0x40 [ 836.171860] ____sys_sendmsg+0x1d1/0x1f0 [ 836.172535] ___sys_sendmsg+0xab/0xf0 [ 836.173183] __sys_sendmsg+0x51/0x90 [ 836.173836] do_syscall_64+0x3d/0x90 [ 836.174471] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 836.175282] [ 836.175282] -> #2 (rtnl_mutex){+.+.}-{3:3}: [ 836.176190] __mutex_lock+0x6b/0xf80 [ 836.176830] register_netdevice_notifier+0x21/0x120 [ 836.177631] rtnetlink_init+0x2d/0x1e9 [ 836.178289] netlink_proto_init+0x163/0x179 [ 836.178994] do_one_initcall+0x63/0x300 [ 836.179672] kernel_init_freeable+0x2cb/0x31b [ 836.180403] kernel_init+0x17/0x140 [ 836.181035] ret_from_fork+0x1f/0x30 [ 836.181687] -> #1 (pernet_ops_rwsem){+.+.}-{3:3}: [ 836.182628] down_write+0x25/0x60 [ 836.183235] unregister_netdevice_notifier+0x1c/0xb0 [ 836.184029] mlx5_ib_roce_cleanup+0x94/0x120 [mlx5_ib] [ 836.184855] __mlx5_ib_remove+0x35/0x60 [mlx5_ib] [ 836.185637] mlx5_eswitch_unregister_vport_reps+0x22f/0x440 [mlx5_core] [ 836.186698] auxiliary_bus_remove+0x18/0x30 [ 836.187409] device_release_driver_internal+0x1f6/0x270 [ 836.188253] bus_remove_device+0xef/0x160 [ 836.188939] device_del+0x18b/0x3f0 [ 836.189562] mlx5_rescan_drivers_locked+0xd6/0x2d0 [mlx5_core] [ 836.190516] mlx5_lag_remove_devices+0x69/0xe0 [mlx5_core] [ 836.191414] mlx5_do_bond_work+0x441/0x620 [mlx5_core] [ 836.192278] process_one_work+0x25c/0x590 [ 836.192963] worker_thread+0x4f/0x3d0 [ 836.193609] kthread+0xcb/0xf0 [ 836.194189] ret_from_fork+0x1f/0x30 [ 836.194826] -> #0 (&ldev->lock){+.+.}-{3:3}: [ 836.195734] __lock_acquire+0x15b8/0x2a10 [ 836.196426] lock_acquire+0xce/0x2d0 [ 836.197057] __mutex_lock+0x6b/0xf80 [ 836.197708] mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.198575] tc_act_parse_mirred+0x25b/0x800 [mlx5_core] [ 836.199467] parse_tc_actions+0x168/0x5a0 [mlx5_core] [ 836.200340] __mlx5e_add_fdb_flow+0x263/0x480 [mlx5_core] [ 836.201241] mlx5e_configure_flower+0x8a0/0x1820 [mlx5_core] [ 836.202187] tc_setup_cb_add+0xd7/0x200 [ 836.202856] fl_hw_replace_filter+0x14c/0x1f0 [cls_flower] [ 836.203739] fl_change+0xbbe/0x1730 [cls_flower] [ 836.204501] tc_new_tfilter+0x407/0xd90 [ 836.205168] rtnetlink_rcv_msg+0x406/0x5a0 [ 836.205877] netlink_rcv_skb+0x4e/0xf0 [ 836.206535] netlink_unicast+0x190/0x250 [ 836.207217] netlink_sendmsg+0x243/0x4b0 [ 836.207915] sock_sendmsg+0x33/0x40 [ 836.208538] ____sys_sendmsg+0x1d1/0x1f0 [ 836.209219] ___sys_sendmsg+0xab/0xf0 [ 836.209878] __sys_sendmsg+0x51/0x90 [ 836.210510] do_syscall_64+0x3d/0x90 [ 836.211137] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 836.211954] other info that might help us debug this: [ 836.213174] Chain exists of: [ 836.213174] &ldev->lock --> rtnl_mutex --> &block->cb_lock 836.214650] Possible unsafe locking scenario: [ 836.214650] [ 836.215574] CPU0 CPU1 [ 836.216255] ---- ---- [ 836.216943] lock(&block->cb_lock); [ 836.217518] lock(rtnl_mutex); [ 836.218348] lock(&block->cb_lock); [ 836.219212] lock(&ldev->lock); [ 836.219758] [ 836.219758] *** DEADLOCK *** [ 836.219758] [ 836.220747] 2 locks held by handler1/12198: [ 836.221390] #0: ffff8881d4de2930 (&block->cb_lock){++++}-{3:3}, at: tc_setup_cb_add+0x5b/0x200 [ 836.222646] #1: ffff88810c9a92c0 (&esw->mode_lock){++++}-{3:3}, at: mlx5_esw_hold+0x39/0x50 [mlx5_core] [ 836.224063] stack backtrace: [ 836.224799] CPU: 6 PID: 12198 Comm: handler1 Not tainted 5.19.0-rc5_net_56b7df2 #1 [ 836.225923] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 836.227476] Call Trace: [ 836.227929] <TASK> [ 836.228332] dump_stack_lvl+0x57/0x7d [ 836.228924] check_noncircular+0x104/0x120 [ 836.229562] __lock_acquire+0x15b8/0x2a10 [ 836.230201] lock_acquire+0xce/0x2d0 [ 836.230776] ? mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.231614] ? find_held_lock+0x2b/0x80 [ 836.232221] __mutex_lock+0x6b/0xf80 [ 836.232799] ? mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.233636] ? mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.234451] ? xa_load+0xc3/0x190 [ 836.234995] mlx5_lag_do_mirred+0x3b/0x70 [mlx5_core] [ 836.235803] tc_act_parse_mirred+0x25b/0x800 [mlx5_core] [ 836.236636] ? tc_act_can_offload_mirred+0x135/0x210 [mlx5_core] [ 836.237550] parse_tc_actions+0x168/0x5a0 [mlx5_core] [ 836.238364] __mlx5e_add_fdb_flow+0x263/0x480 [mlx5_core] [ 836.239202] mlx5e_configure_flower+0x8a0/0x1820 [mlx5_core] [ 836.240076] ? lock_acquire+0xce/0x2d0 [ 836.240668] ? tc_setup_cb_add+0x5b/0x200 [ 836.241294] tc_setup_cb_add+0xd7/0x200 [ 836.241917] fl_hw_replace_filter+0x14c/0x1f0 [cls_flower] [ 836.242709] fl_change+0xbbe/0x1730 [cls_flower] [ 836.243408] tc_new_tfilter+0x407/0xd90 [ 836.244043] ? tc_del_tfilter+0x880/0x880 [ 836.244672] rtnetlink_rcv_msg+0x406/0x5a0 [ 836.245310] ? netlink_deliver_tap+0x7a/0x4b0 [ 836.245991] ? if_nlmsg_stats_size+0x2b0/0x2b0 [ 836.246675] netlink_rcv_skb+0x4e/0xf0 [ 836.258046] netlink_unicast+0x190/0x250 [ 836.258669] netlink_sendmsg+0x243/0x4b0 [ 836.259288] sock_sendmsg+0x33/0x40 [ 836.259857] ____sys_sendmsg+0x1d1/0x1f0 [ 836.260473] ___sys_sendmsg+0xab/0xf0 [ 836.261064] ? lock_acquire+0xce/0x2d0 [ 836.261669] ? find_held_lock+0x2b/0x80 [ 836.262272] ? __fget_files+0xb9/0x190 [ 836.262871] ? __fget_files+0xd3/0x190 [ 836.263462] __sys_sendmsg+0x51/0x90 [ 836.264064] do_syscall_64+0x3d/0x90 [ 836.264652] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 836.265425] RIP: 0033:0x7fdbe5e2677d [ 836.266012] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 ba ee ff ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 ee ee ff ff 48 [ 836.268485] RSP: 002b:00007fdbe48a75a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [ 836.269598] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fdbe5e2677d [ 836.270576] RDX: 0000000000000000 RSI: 00007fdbe48a7640 RDI: 000000000000003c [ 836.271565] RBP: 00007fdbe48a8368 R08: 0000000000000000 R09: 0000000000000000 [ 836.272546] R10: 00007fdbe48a84b0 R11: 0000000000000293 R12: 0000557bd17dc860 [ 836.273527] R13: 0000000000000000 R14: 0000557bd17dc860 R15: 00007fdbe48a7640 [ 836.274521] </TASK> To avoid using mode holding ldev->lock in the configure flow, we queue a work to the lag workqueue and cease wait on a completion object. In addition, we remove the lock from mlx5_lag_do_mirred() since it is not really protecting anything. It should be noted that an actual deadlock has not been observed. Signed-off-by: Eli Cohen <[email protected]> Reviewed-by: Mark Bloch <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5: Fix handling of entry refcount when command is not issued to FWMoshe Shemesh1-3/+3
In case command interface is down, or the command is not allowed, driver did not increment the entry refcount, but might have decrement as part of forced completion handling. Fix that by always increment and decrement the refcount to make it symmetric for all flows. Fixes: 50b2412b7e78 ("net/mlx5: Avoid possible free of command entry while timeout comp handler") Signed-off-by: Eran Ben Elisha <[email protected]> Signed-off-by: Moshe Shemesh <[email protected]> Reported-by: Jack Wang <[email protected]> Tested-by: Jack Wang <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5: cmdif, Print info on any firmware cmd failure to tracepointMoshe Shemesh3-19/+68
While moving to new CMD API (quiet API), some pre-existing flows may call the new API function that in case of error, returns the error instead of printing it as previously done. For such flows we bring back the print but to tracepoint this time for sys admins to have the ability to check for errors especially for commands using the new quiet API. Tracepoint output example: devlink-1333 [001] ..... 822.746922: mlx5_cmd: ACCESS_REG(0x805) op_mod(0x0) failed, status bad resource(0x5), syndrome (0xb06e1f), err(-22) Fixes: f23519e542e5 ("net/mlx5: cmdif, Add new api for command execution") Signed-off-by: Moshe Shemesh <[email protected]> Reviewed-by: Shay Drory <[email protected]> Reviewed-by: Maor Gottlieb <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5: SF: Fix probing active SFs during driver probe phaseShay Drory1-0/+88
When SF devices and SF port representors are located on different functions, unloading and reloading of SF parent driver doesn't recreate the existing SF present in the device. Fix it by querying SFs and probe active SFs during driver probe phase. Fixes: 90d010b8634b ("net/mlx5: SF, Add auxiliary device support") Signed-off-by: Shay Drory <[email protected]> Reviewed-by: Parav Pandit <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5: Fix FW tracer timestamp calculationMoshe Shemesh1-1/+1
Fix a bug in calculation of FW tracer timestamp. Decreasing one in the calculation should effect only bits 52_7 and not effect bits 6_0 of the timestamp, otherwise bits 6_0 are always set in this calculation. Fixes: 70dd6fdb8987 ("net/mlx5: FW tracer, parse traces and kernel tracing support") Signed-off-by: Moshe Shemesh <[email protected]> Reviewed-by: Feras Daoud <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21net/mlx5: Do not query pci info while pci disabledRoy Novich1-3/+6
The driver should not interact with PCI while PCI is disabled. Trying to do so may result in being unable to get vital signs during PCI reset, driver gets timed out and fails to recover. Fixes: fad1783a6d66 ("net/mlx5: Print more info on pci error handlers") Signed-off-by: Roy Novich <[email protected]> Reviewed-by: Moshe Shemesh <[email protected]> Reviewed-by: Aya Levin <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-11-21drm/amd/amdgpu: reserve vm invalidation engine for firmwareJack Xiao1-0/+6
If mes enabled, reserve VM invalidation engine 5 for firmware. Signed-off-by: Jack Xiao <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected] # 6.0.x
2022-11-21drm/amdgpu: Enable Aldebaran devices to report CU OccupancyRamesh Errabolu1-0/+1
Allow user to know number of compute units (CU) that are in use at any given moment. Enable access to the method kgd_gfx_v9_get_cu_occupancy that computes CU occupancy. Signed-off-by: Ramesh Errabolu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-11-21drm/amdgpu: fix userptr HMM range handling v2Christian König7-51/+46
The basic problem here is that it's not allowed to page fault while holding the reservation lock. So it can happen that multiple processes try to validate an userptr at the same time. Work around that by putting the HMM range object into the mutex protected bo list for now. v2: make sure range is set to NULL in case of an error Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> CC: [email protected] Signed-off-by: Alex Deucher <[email protected]>
2022-11-21drm/amdgpu: always register an MMU notifier for userptrChristian König1-5/+3
Since switching to HMM we always need that because we no longer grab references to the pages. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Felix Kuehling <[email protected]> CC: [email protected] Signed-off-by: Alex Deucher <[email protected]>
2022-11-21drm/amdgpu/dm/mst: Fix uninitialized var in ↵Lyude Paul1-1/+1
pre_compute_mst_dsc_configs_for_state() Coverity noticed this one, so let's fix it. Fixes: ba891436c2d2b2 ("drm/amdgpu/mst: Stop ignoring error codes and deadlocking") Signed-off-by: Lyude Paul <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Reviewed-by: Harry Wentland <[email protected]> Cc: [email protected] # v5.6+
2022-11-21drm/amdgpu/dm/dp_mst: Don't grab mst_mgr->lock when computing DSC stateLyude Paul1-4/+0
Now that we've fixed the issue with using the incorrect topology manager, we're actually grabbing the topology manager's lock - and consequently deadlocking. Luckily for us though, there's actually nothing in AMD's DSC state computation code that really should need this lock. The one exception is the mutex_lock() in dm_dp_mst_is_port_support_mode(), however we grab no locks beneath &mgr->lock there so that should be fine to leave be. Gitlab issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2171 Signed-off-by: Lyude Paul <[email protected]> Fixes: 8c20a1ed9b4f ("drm/amd/display: MST DSC compute fair share") Cc: <[email protected]> # v5.6+ Reviewed-by: Wayne Lin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-21drm/amdgpu/dm/mst: Use the correct topology mgr pointer in amdgpu_dm_connectorLyude Paul1-19/+18
This bug hurt me. Basically, it appears that we've been grabbing the entirely wrong mutex in the MST DSC computation code for amdgpu! While we've been grabbing: amdgpu_dm_connector->mst_mgr That's zero-initialized memory, because the only connectors we'll ever actually be doing DSC computations for are MST ports. Which have mst_mgr zero-initialized, and instead have the correct topology mgr pointer located at: amdgpu_dm_connector->mst_port->mgr; I'm a bit impressed that until now, this code has managed not to crash anyone's systems! It does seem to cause a warning in LOCKDEP though: [ 66.637670] DEBUG_LOCKS_WARN_ON(lock->magic != lock) This was causing the problems that appeared to have been introduced by: commit 4d07b0bc4034 ("drm/display/dp_mst: Move all payload info into the atomic state") This wasn't actually where they came from though. Presumably, before the only thing we were doing with the topology mgr pointer was attempting to grab mst_mgr->lock. Since the above commit however, we grab much more information from mst_mgr including the atomic MST state and respective modesetting locks. This patch also implies that up until now, it's quite likely we could be susceptible to race conditions when going through the MST topology state for DSC computations since we technically will not have grabbed any lock when going through it. So, let's fix this by adjusting all the respective code paths to look at the right pointer and skip things that aren't actual MST connectors from a topology. Gitlab issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2171 Signed-off-by: Lyude Paul <[email protected]> Fixes: 8c20a1ed9b4f ("drm/amd/display: MST DSC compute fair share") Cc: <[email protected]> # v5.6+ Reviewed-by: Wayne Lin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-21drm/display/dp_mst: Fix drm_dp_mst_add_affected_dsc_crtcs() return codeLyude Paul1-1/+1
Looks like that we're accidentally dropping a pretty important return code here. For some reason, we just return -EINVAL if we fail to get the MST topology state. This is wrong: error codes are important and should never be squashed without being handled, which here seems to have the potential to cause a deadlock. Signed-off-by: Lyude Paul <[email protected]> Reviewed-by: Wayne Lin <[email protected]> Fixes: 8ec046716ca8 ("drm/dp_mst: Add helper to trigger modeset on affected DSC MST CRTCs") Cc: <[email protected]> # v5.6+ Signed-off-by: Alex Deucher <[email protected]>
2022-11-21drm/amdgpu/mst: Stop ignoring error codes and deadlockingLyude Paul3-118/+147
It appears that amdgpu makes the mistake of completely ignoring the return values from the DP MST helpers, and instead just returns a simple true/false. In this case, it seems to have come back to bite us because as a result of simply returning false from compute_mst_dsc_configs_for_state(), amdgpu had no way of telling when a deadlock happened from these helpers. This could definitely result in some kernel splats. V2: * Address Wayne's comments (fix another bunch of spots where we weren't passing down return codes) Signed-off-by: Lyude Paul <[email protected]> Fixes: 8c20a1ed9b4f ("drm/amd/display: MST DSC compute fair share") Cc: Harry Wentland <[email protected]> Cc: <[email protected]> # v5.6+ Reviewed-by: Wayne Lin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-21drm/amd/display: Align dcn314_smu logging with other DCNsRoman Li1-2/+9
[Why] Assert on non-OK response from SMU is unnecessary. It was replaced with respective log message on other asics in the past with commit: "drm/amd/display: Removing assert statements for Linux" [How] Remove assert and add dbg logging as on other DCNs. Signed-off-by: Roman Li <[email protected]> Reviewed-by: Saaem Rizvi <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-21cifs: fix missing unlock in cifs_file_copychunk_range()ChenXiaoSong1-1/+3
xfstests generic/013 and generic/476 reported WARNING as follows: WARNING: lock held when returning to user space! 6.1.0-rc5+ #4 Not tainted ------------------------------------------------ fsstress/504233 is leaving the kernel with locks still held! 2 locks held by fsstress/504233: #0: ffff888054c38850 (&sb->s_type->i_mutex_key#21){+.+.}-{3:3}, at: lock_two_nondirectories+0xcf/0xf0 #1: ffff8880b8fec750 (&sb->s_type->i_mutex_key#21/4){+.+.}-{3:3}, at: lock_two_nondirectories+0xb7/0xf0 This will lead to deadlock and hungtask. Fix this by releasing locks when failed to write out on a file range in cifs_file_copychunk_range(). Fixes: 3e3761f1ec7d ("smb3: use filemap_write_and_wait_range instead of filemap_write_and_wait") Cc: [email protected] # 6.0 Reviewed-by: Paulo Alcantara (SUSE) <[email protected]> Signed-off-by: ChenXiaoSong <[email protected]> Signed-off-by: Steve French <[email protected]>
2022-11-21clocksource/drivers/arm_arch_timer: Fix XGene-1 TVAL register math errorJoe Korty1-2/+5
The TVAL register is 32 bit signed. Thus only the lower 31 bits are available to specify when an interrupt is to occur at some time in the near future. Attempting to specify a larger interval with TVAL results in a negative time delta which means the timer fires immediately upon being programmed, rather than firing at that expected future time. The solution is for Linux to declare that TVAL is a 31 bit register rather than give its true size of 32 bits. This prevents Linux from programming TVAL with a too-large value. Note that, prior to 5.16, this little trick was the standard way to handle TVAL in Linux, so there is nothing new happening here on that front. The softlockup detector hides the issue, because it keeps generating short timer deadlines that are within the scope of the broken timer. Disabling it, it starts using NO_HZ with much longer timer deadlines, which turns into an interrupt flood: 11: 1124855130 949168462 758009394 76417474 104782230 30210281 310890 1734323687 GICv2 29 Level arch_timer And "much longer" isn't that long: it takes less than 43s to underflow TVAL at 50MHz (the frequency of the counter on XGene-1). Some comments on the v1 version of this patch by Marc Zyngier: XGene implements CVAL (a 64bit comparator) in terms of TVAL (a countdown register) instead of the other way around. TVAL being a 32bit register, the width of the counter should equally be 32. However, TVAL is a *signed* value, and keeps counting down in the negative range once the timer fires. It means that any TVAL value with bit 31 set will fire immediately, as it cannot be distinguished from an already expired timer. Reducing the timer range back to a paltry 31 bits papers over the issue. Another problem cannot be fixed though, which is that the timer interrupt *must* be handled within the negative countdown period, or the interrupt will be lost (TVAL will rollover to a positive value, indicative of a new timer deadline). Fixes: 012f18850452 ("clocksource/drivers/arm_arch_timer: Work around broken CVAL implementations") Signed-off-by: Joe Korty <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] [maz: revamped the commit message]
2022-11-21Merge tag 'am335x-pcm-953-regulators' of ↵Arnd Bergmann1-15/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/fixes Regulator changes for am335x-pcm-953 This is for deferred probe issue on am335x-pcm-953 sdhci-omap regulator. * tag 'am335x-pcm-953-regulators' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: ARM: dts: am335x-pcm-953: Define fixed regulators in root node Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]>
2022-11-21netfilter: ipset: regression in ip_set_hash_ip.cVishwanath Pai1-5/+3
This patch introduced a regression: commit 48596a8ddc46 ("netfilter: ipset: Fix adding an IPv4 range containing more than 2^31 addresses") The variable e.ip is passed to adtfn() function which finally adds the ip address to the set. The patch above refactored the for loop and moved e.ip = htonl(ip) to the end of the for loop. What this means is that if the value of "ip" changes between the first assignement of e.ip and the forloop, then e.ip is pointing to a different ip address than "ip". Test case: $ ipset create jdtest_tmp hash:ip family inet hashsize 2048 maxelem 100000 $ ipset add jdtest_tmp 10.0.1.1/31 ipset v6.21.1: Element cannot be added to the set: it's already added The value of ip gets updated inside the "else if (tb[IPSET_ATTR_CIDR])" block but e.ip is still pointing to the old value. Fixes: 48596a8ddc46 ("netfilter: ipset: Fix adding an IPv4 range containing more than 2^31 addresses") Reviewed-by: Joshua Hunt <[email protected]> Signed-off-by: Vishwanath Pai <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-11-21btrfs: qgroup: fix sleep from invalid context bug in btrfs_qgroup_inherit()ChenXiaoSong1-8/+1
Syzkaller reported BUG as follows: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 Call Trace: <TASK> dump_stack_lvl+0xcd/0x134 __might_resched.cold+0x222/0x26b kmem_cache_alloc+0x2e7/0x3c0 update_qgroup_limit_item+0xe1/0x390 btrfs_qgroup_inherit+0x147b/0x1ee0 create_subvol+0x4eb/0x1710 btrfs_mksubvol+0xfe5/0x13f0 __btrfs_ioctl_snap_create+0x2b0/0x430 btrfs_ioctl_snap_create_v2+0x25a/0x520 btrfs_ioctl+0x2a1c/0x5ce0 __x64_sys_ioctl+0x193/0x200 do_syscall_64+0x35/0x80 Fix this by calling qgroup_dirty() on @dstqgroup, and update limit item in btrfs_run_qgroups() later outside of the spinlock context. CC: [email protected] # 4.9+ Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: ChenXiaoSong <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-11-21btrfs: send: avoid unaligned encoded writes when attempting to clone rangeFilipe Manana1-1/+23
When trying to see if we can clone a file range, there are cases where we end up sending two write operations in case the inode from the source root has an i_size that is not sector size aligned and the length from the current offset to its i_size is less than the remaining length we are trying to clone. Issuing two write operations when we could instead issue a single write operation is not incorrect. However it is not optimal, specially if the extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed to the send ioctl. In that case we can end up sending an encoded write with an offset that is not sector size aligned, which makes the receiver fallback to decompressing the data and writing it using regular buffered IO (so re-compressing the data in case the fs is mounted with compression enabled), because encoded writes fail with -EINVAL when an offset is not sector size aligned. The following example, which triggered a bug in the receiver code for the fallback logic of decompressing + regular buffer IO and is fixed by the patchset referred in a Link at the bottom of this changelog, is an example where we have the non-optimal behaviour due to an unaligned encoded write: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV > /dev/null mount -o compress $DEV $MNT # File foo has a size of 33K, not aligned to the sector size. xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar # Now clone the first 32K of file bar into foo at offset 0. xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo # Snapshot the default subvolume and create a full send stream (v2). btrfs subvolume snapshot -r $MNT $MNT/snap btrfs send --compressed-data -f /tmp/test.send $MNT/snap echo -e "\nFile bar in the original filesystem:" od -A d -t x1 $MNT/snap/bar umount $MNT mkfs.btrfs -f $DEV > /dev/null mount $DEV $MNT echo -e "\nReceiving stream in a new filesystem..." btrfs receive -f /tmp/test.send $MNT echo -e "\nFile bar in the new filesystem:" od -A d -t x1 $MNT/snap/bar umount $MNT Before this patch, the send stream included one regular write and one encoded write for file 'bar', with the later being not sector size aligned and causing the receiver to fallback to decompression + buffered writes. The output of the btrfs receive command in verbose mode (-vvv): (...) mkfile o258-7-0 rename o258-7-0 -> bar utimes clone bar - source=foo source offset=0 offset=0 length=32768 write bar - offset=32768 length=1024 encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0 encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument") (...) This patch avoids the regular write followed by an unaligned encoded write so that we end up sending a single encoded write that is aligned. So after this patch the stream content is (output of btrfs receive -vvv): (...) mkfile o258-7-0 rename o258-7-0 -> bar utimes clone bar - source=foo source offset=0 offset=0 length=32768 encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0 (...) So we get more optimal behaviour and avoid the silent data loss bug in versions of btrfs-progs affected by the bug referred by the Link tag below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1). Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-11-21btrfs: zoned: fix missing endianness conversion in sb_write_pointerChristoph Hellwig1-1/+2
generation is an on-disk __le64 value, so use btrfs_super_generation to convert it to host endian before comparing it. Fixes: 12659251ca5d ("btrfs: implement log-structured superblock for ZONED mode") CC: [email protected] # 5.15+ Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-11-21x86/pm: Add enumeration check before spec MSRs save/restore setupPawan Gupta1-8/+15
pm_save_spec_msr() keeps a list of all the MSRs which _might_ need to be saved and restored at hibernate and resume. However, it has zero awareness of CPU support for these MSRs. It mostly works by unconditionally attempting to manipulate these MSRs and relying on rdmsrl_safe() being able to handle a #GP on CPUs where the support is unavailable. However, it's possible for reads (RDMSR) to be supported for a given MSR while writes (WRMSR) are not. In this case, msr_build_context() sees a successful read (RDMSR) and marks the MSR as valid. Then, later, a write (WRMSR) fails, producing a nasty (but harmless) error message. This causes restore_processor_state() to try and restore it, but writing this MSR is not allowed on the Intel Atom N2600 leading to: unchecked MSR access error: WRMSR to 0x122 (tried to write 0x0000000000000002) \ at rIP: 0xffffffff8b07a574 (native_write_msr+0x4/0x20) Call Trace: <TASK> restore_processor_state x86_acpi_suspend_lowlevel acpi_suspend_enter suspend_devices_and_enter pm_suspend.cold state_store kernfs_fop_write_iter vfs_write ksys_write do_syscall_64 ? do_syscall_64 ? up_read ? lock_is_held_type ? asm_exc_page_fault ? lockdep_hardirqs_on entry_SYSCALL_64_after_hwframe To fix this, add the corresponding X86_FEATURE bit for each MSR. Avoid trying to manipulate the MSR when the feature bit is clear. This required adding a X86_FEATURE bit for MSRs that do not have one already, but it's a small price to pay. [ bp: Move struct msr_enumeration inside the only function that uses it. ] Fixes: 73924ec4d560 ("x86/pm: Save the MSR validity status at context setup") Reported-by: Hans de Goede <[email protected]> Signed-off-by: Pawan Gupta <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/c24db75d69df6e66c0465e13676ad3f2837a2ed8.1668539735.git.pawan.kumar.gupta@linux.intel.com
2022-11-21x86/tsx: Add a feature bit for TSX control MSR supportPawan Gupta2-21/+20
Support for the TSX control MSR is enumerated in MSR_IA32_ARCH_CAPABILITIES. This is different from how other CPU features are enumerated i.e. via CPUID. Currently, a call to tsx_ctrl_is_supported() is required for enumerating the feature. In the absence of a feature bit for TSX control, any code that relies on checking feature bits directly will not work. In preparation for adding a feature bit check in MSR save/restore during suspend/resume, set a new feature bit X86_FEATURE_TSX_CTRL when MSR_IA32_TSX_CTRL is present. Also make tsx_ctrl_is_supported() use the new feature bit to avoid any overhead of reading the MSR. [ bp: Remove tsx_ctrl_is_supported(), add room for two more feature bits in word 11 which are coming up in the next merge window. ] Suggested-by: Andrew Cooper <[email protected]> Signed-off-by: Pawan Gupta <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/de619764e1d98afbb7a5fa58424f1278ede37b45.1668539735.git.pawan.kumar.gupta@linux.intel.com
2022-11-21octeontx2-af: cn10k: mcs: Fix copy and paste bug in mcs_bbe_intr_handler()Dan Carpenter1-1/+1
This code accidentally uses the RX macro twice instead of the RX and TX. Fixes: 6c635f78c474 ("octeontx2-af: cn10k: mcs: Handle MCS block interrupts") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-11-21ipv4/fib: Replace zero-length array with DECLARE_FLEX_ARRAY() helperKees Cook1-1/+1
Zero-length arrays are deprecated[1] and are being replaced with flexible array members in support of the ongoing efforts to tighten the FORTIFY_SOURCE routines on memcpy(), correctly instrument array indexing with UBSAN_BOUNDS, and to globally enable -fstrict-flex-arrays=3. Replace zero-length array with flexible-array member in struct key_vector. This results in no differences in binary output. [1] https://github.com/KSPP/linux/issues/78 Cc: Jakub Kicinski <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Hideaki YOSHIFUJI <[email protected]> Cc: David Ahern <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Paolo Abeni <[email protected]> Cc: "Gustavo A. R. Silva" <[email protected]> Cc: [email protected] Signed-off-by: Kees Cook <[email protected]> Reviewed-by: Gustavo A. R. Silva <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-11-21selftests/net: Find nettest in current directoryDaniel Díaz2-8/+13
The `nettest` binary, built from `selftests/net/nettest.c`, was expected to be found in the path during test execution of `fcnal-test.sh` and `pmtu.sh`, leading to tests getting skipped when the binary is not installed in the system, as can be seen in these logs found in the wild [1]: # TEST: vti4: PMTU exceptions [SKIP] [ 350.600250] IPv6: ADDRCONF(NETDEV_CHANGE): veth_b: link becomes ready [ 350.607421] IPv6: ADDRCONF(NETDEV_CHANGE): veth_a: link becomes ready # 'nettest' command not found; skipping tests # xfrm6udp not supported # TEST: vti6: PMTU exceptions (ESP-in-UDP) [SKIP] [ 351.605102] IPv6: ADDRCONF(NETDEV_CHANGE): veth_b: link becomes ready [ 351.612243] IPv6: ADDRCONF(NETDEV_CHANGE): veth_a: link becomes ready # 'nettest' command not found; skipping tests # xfrm4udp not supported The `unicast_extensions.sh` tests also rely on `nettest`, but it runs fine there because it looks for the binary in the current working directory [2]: The same mechanism that works for the Unicast extensions tests is here copied over to the PMTU and functional tests. [1] https://lkft.validation.linaro.org/scheduler/job/5839508#L6221 [2] https://lkft.validation.linaro.org/scheduler/job/5839508#L7958 Signed-off-by: Daniel Díaz <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-11-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfDavid S. Miller12-41/+49
Pablo Neira Ayuso says: ==================== The following patchset contains late Netfilter fixes for net: 1) Use READ_ONCE()/WRITE_ONCE() to update ct->mark, from Daniel Xu. Not reported by syzbot, but I presume KASAN would trigger post a splat on this. This is a rather old issue, predating git history. 2) Do not set up extensions for set element with end interval flag set on. This leads to bogusly skipping this elements as expired when listing the set/map to userspace as well as increasing memory consumpton when stateful expressions are used. This issue has been present since 4.18, when timeout support for rbtree set was added. ==================== Signed-off-by: David S. Miller <[email protected]>