aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-11-16selftests: add a test case for mirred egress to ingressDavide Caratti2-1/+47
add a selftest that verifies the correct behavior of TC act_mirred egress to ingress: in particular, it checks if the dst_entry is removed from skb before redirect egress -> ingress. The correct behavior is: an ICMP 'echo request' generated by ping will be received and generate a reply the same way as the one generated by mausezahn. Suggested-by: Jamal Hadi Salim <[email protected]> Signed-off-by: Davide Caratti <[email protected]> Acked-by: Cong Wang <[email protected]> Reviewed-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-16net: sched: act_mirred: drop dst for the direction from egress to ingressXin Long1-3/+8
Without dropping dst, the packets sent from local mirred/redirected to ingress will may still use the old dst. ip_rcv() will drop it as the old dst is for output and its .input is dst_discard. This patch is to fix by also dropping dst for those packets that are mirred or redirected from egress to ingress in act_mirred. Note that we don't drop it for the direction change from ingress to egress, as on which there might be a user case attaching a metadata dst by act_tunnel_key that would be used later. Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Xin Long <[email protected]> Acked-by: Cong Wang <[email protected]> Reviewed-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-16amt: cancel delayed_work synchronously in amt_fini()Taehee Yoo1-1/+1
When the amt module is being removed, it calls cancel_delayed_work() to cancel pending delayed_work. But this function doesn't wait for canceling delayed_work. So, workers can be still doing after module delete. In order to avoid this, cancel_delayed_work_sync() should be used instead. Suggested-by: Jakub Kicinski <[email protected]> Fixes: bc54e49c140b ("amt: add multicast(IGMP) report message handler") Signed-off-by: Taehee Yoo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-16MAINTAINERS: remove [email protected]Pavel Skripkin1-2/+2
I've sent a patch to [email protected] few days ago and got a reply from [email protected]: Delivery has failed to these recipients or groups: [email protected]<mailto:[email protected]> The email address you entered couldn't be found. Please check the recipient's email address and try to resend the message. If the problem continues, please contact your helpdesk. As requested by Alok Prasad, replacing [email protected] with Manish Chopra's email address. [0] Link: https://lore.kernel.org/all/[email protected]/ [0] Signed-off-by: Pavel Skripkin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-16bnxt_en: Fix compile error regression when CONFIG_BNXT_SRIOV is not setMichael Chan2-1/+11
bp->sriov_cfg is not defined when CONFIG_BNXT_SRIOV is not set. Fix it by adding a helper function bnxt_sriov_cfg() to handle the logic with or without the config option. Fixes: 46d08f55d24e ("bnxt_en: extend RTNL to VF check in devlink driver_reinit") Reported-by: kernel test robot <[email protected]> Reviewed-by: Edwin Peer <[email protected]> Reviewed-by: Andy Gospodarek <[email protected]> Signed-off-by: Michael Chan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-16net: mvmdio: fix compilation warningMarcin Wojtas1-0/+2
The kernel test robot reported a following issue: >> drivers/net/ethernet/marvell/mvmdio.c:426:36: warning: unused variable 'orion_mdio_acpi_match' [-Wunused-const-variable] static const struct acpi_device_id orion_mdio_acpi_match[] = { ^ 1 warning generated. Fix that by surrounding the variable by appropriate ifdef. Fixes: c54da4c1acb1 ("net: mvmdio: add ACPI support") Reported-by: kernel test robot <[email protected]> Signed-off-by: Marcin Wojtas <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-16Merge tag 'mac80211-for-net-2021-11-16' of ↵Jakub Kicinski10-58/+63
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Couple of fixes: * bad dont-reorder check * throughput LED trigger for various new(ish) paths * radiotap header generation * locking assertions in mac80211 with monitor mode * radio statistics * don't try to access IV when not present * call stop_ap for P2P_GO as well as we should * tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211: mac80211: fix throughput LED trigger mac80211: fix monitor_sdata RCU/locking assertions mac80211: drop check for DONT_REORDER in __ieee80211_select_queue mac80211: fix radiotap header generation mac80211: do not access the IV when it was stripped nl80211: fix radio statistics in survey dump cfg80211: call cfg80211_stop_ap when switch from P2P_GO type ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-16Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski23-73/+573
Daniel Borkmann says: ==================== pull-request: bpf 2021-11-16 We've added 12 non-merge commits during the last 5 day(s) which contain a total of 23 files changed, 573 insertions(+), 73 deletions(-). The main changes are: 1) Fix pruning regression where verifier went overly conservative rejecting previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer. 2) Fix verifier TOCTOU bug when using read-only map's values as constant scalars during verification, from Daniel Borkmann. 3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson. 4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker. 5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_*() helpers in tracing programs due to deadlock possibilities, from Dmitrii Banshchikov. 6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang. 7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin. 8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: udp: Validate checksum in udp_read_sock() bpf: Fix toctou on read-only map's constant scalar tracking samples/bpf: Fix build error due to -isystem removal selftests/bpf: Add tests for restricted helpers bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs libbpf: Perform map fd cleanup for gen_loader in case of error samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu tools/runqslower: Fix cross-build samples/bpf: Fix summary per-sec stats in xdp_sample_user selftests/bpf: Check map in map pruning bpf: Fix inner map state pruning regression. xsk: Fix crash on double free in buffer pool ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-11-16scsi: qla2xxx: Fix mailbox direction flags in qla2xxx_get_adapter_id()Ewan D. Milne1-4/+2
The SCM changes set the flags in mcp->out_mb instead of mcp->in_mb so the data was not actually being read into the mcp->mb[] array from the adapter. Link: https://lore.kernel.org/r/[email protected] Fixes: 9f2475fe7406 ("scsi: qla2xxx: SAN congestion management implementation") Cc: [email protected] Reviewed-by: Himanshu Madhani <[email protected]> Reviewed-by: Arun Easi <[email protected]> Signed-off-by: Ewan D. Milne <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2021-11-16scsi: ufs: core: Fix another task management completion raceAdrian Hunter1-2/+1
hba->outstanding_tasks, which is read under host_lock spinlock, tells the interrupt handler what task management tags are in use by the driver. The doorbell register bits indicate which tags are in use by the hardware. A doorbell bit that is 0 is because the bit has yet to be set by the driver, or because the task is complete. It is only possible to disambiguate the 2 cases, if reading/writing the doorbell register is synchronized with reading/writing hba->outstanding_tasks. For that reason, reading REG_UTP_TASK_REQ_DOOR_BELL must be done under spinlock. Link: https://lore.kernel.org/r/[email protected] Fixes: f5ef336fd2e4 ("scsi: ufs: core: Fix task management completion") Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2021-11-16scsi: ufs: core: Fix task management completion timeout raceAdrian Hunter1-5/+0
__ufshcd_issue_tm_cmd() clears req->end_io_data after timing out, which races with the completion function ufshcd_tmc_handler() which expects req->end_io_data to have a value. Note __ufshcd_issue_tm_cmd() and ufshcd_tmc_handler() are already synchronized using hba->tmf_rqs and hba->outstanding_tasks under the host_lock spinlock. It is also not necessary (nor typical) to clear req->end_io_data because the block layer does it before allocating out requests e.g. via blk_get_request(). So fix by not clearing it. Link: https://lore.kernel.org/r/[email protected] Fixes: f5ef336fd2e4 ("scsi: ufs: core: Fix task management completion") Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2021-11-16scsi: core: sysfs: Fix hang when device state is set via sysfsMike Christie1-11/+19
This fixes a regression added with: commit f0f82e2476f6 ("scsi: core: Fix capacity set to zero after offlinining device") The problem is that after iSCSI recovery, iscsid will call into the kernel to set the dev's state to running, and with that patch we now call scsi_rescan_device() with the state_mutex held. If the SCSI error handler thread is just starting to test the device in scsi_send_eh_cmnd() then it's going to try to grab the state_mutex. We are then stuck, because when scsi_rescan_device() tries to send its I/O scsi_queue_rq() calls -> scsi_host_queue_ready() -> scsi_host_in_recovery() which will return true (the host state is still in recovery) and I/O will just be requeued. scsi_send_eh_cmnd() will then never be able to grab the state_mutex to finish error handling. To prevent the deadlock move the rescan-related code to after we drop the state_mutex. This also adds a check for if we are already in the running state. This prevents extra scans and helps the iscsid case where if the transport class has already onlined the device during its recovery process then we don't need userspace to do it again plus possibly block that daemon. Link: https://lore.kernel.org/r/[email protected] Fixes: f0f82e2476f6 ("scsi: core: Fix capacity set to zero after offlinining device") Cc: Bart Van Assche <[email protected]> Cc: lijinlin <[email protected]> Cc: Wu Bo <[email protected]> Reviewed-by: Lee Duncan <[email protected]> Reviewed-by: Wu Bo <[email protected]> Signed-off-by: Mike Christie <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2021-11-16scsi: iscsi: Unblock session then wake up error handlerMike Christie1-3/+3
We can race where iscsi_session_recovery_timedout() has woken up the error handler thread and it's now setting the devices to offline, and session_recovery_timedout()'s call to scsi_target_unblock() is also trying to set the device's state to transport-offline. We can then get a mix of states. For the case where we can't relogin we want the devices to be in transport-offline so when we have repaired the connection __iscsi_unblock_session() can set the state back to running. Set the device state then call into libiscsi to wake up the error handler. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Lee Duncan <[email protected]> Signed-off-by: Mike Christie <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2021-11-16scsi: ufs: core: Improve SCSI abort handlingBart Van Assche1-0/+1
The following has been observed on a test setup: WARNING: CPU: 4 PID: 250 at drivers/scsi/ufs/ufshcd.c:2737 ufshcd_queuecommand+0x468/0x65c Call trace: ufshcd_queuecommand+0x468/0x65c scsi_send_eh_cmnd+0x224/0x6a0 scsi_eh_test_devices+0x248/0x418 scsi_eh_ready_devs+0xc34/0xe58 scsi_error_handler+0x204/0x80c kthread+0x150/0x1b4 ret_from_fork+0x10/0x30 That warning is triggered by the following statement: WARN_ON(lrbp->cmd); Fix this warning by clearing lrbp->cmd from the abort handler. Link: https://lore.kernel.org/r/[email protected] Fixes: 7a3e97b0dc4b ("[SCSI] ufshcd: UFS Host controller driver") Reviewed-by: Bean Huo <[email protected]> Reviewed-by: Stanley Chu <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2021-11-17ata: libata: improve ata_read_log_page() error messageDamien Le Moal1-2/+3
If ata_read_log_page() fails to read a log page, the ata_dev_err() error message only print the page number, omitting the log number. In case of error, facilitate debugging by also printing the log number. Cc: [email protected] # 5.15 Signed-off-by: Damien Le Moal <[email protected]> Tested-by: Matthew Perkowski <[email protected]>
2021-11-16net/mlx5: E-Switch, return error if encap isn't supportedRaed Salem1-1/+1
On regular ConnectX HCAs getting encap mode isn't supported when the E-Switch is in NONE mode. Current code would return no error code when trying to get encap mode in such case which is wrong. Fix by returning error value to indicate failure to caller in such case. Fixes: 8e0aa4bc959c ("net/mlx5: E-switch, Protect eswitch mode changes") Signed-off-by: Raed Salem <[email protected]> Reviewed-by: Mark Bloch <[email protected]> Reviewed-by: Maor Dickman <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16net/mlx5: Lag, update tracker when state change event receivedMaher Sanalla1-15/+13
Currently, In NETDEV_CHANGELOWERSTATE/NETDEV_CHANGEUPPERSTATE events handling, tracking is not fully completed if the LAG device is not ready at the time the events occur. But, we must keep track of the upper and lower states after receiving the events because RoCE needs this info in mlx5_lag_get_roce_netdev() - in order to return the corresponding port that its running on. Returning the wrong (not most recent) port will lead to gids table being incorrect. For example: If during the attachment of a slave to the bond, the other non-attached port performs pci_reload, then the LAG device is not ready, but that should not result in dismissing attached slave tracker update automatically (which is performed in mlx5_handle_changelowerstate()), Since these events might not come later, which can lead to both bond ports having tx_enabled=0 - which is not a valid state of LAG bond. Fixes: 9b412cc35f00 ("net/mlx5e: Add LAG warning if bond slave is not lag master") Signed-off-by: Maher Sanalla <[email protected]> Reviewed-by: Mark Bloch <[email protected]> Reviewed-by: Jianbo Liu <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16net/mlx5e: CT, Fix multiple allocations and memleak of mod actsRoi Dayan3-11/+25
CT clear action offload adds additional mod hdr actions to the flow's original mod actions in order to clear the registers which hold ct_state. When such flow also includes encap action, a neigh update event can cause the driver to unoffload the flow and then reoffload it. Each time this happens, the ct clear handling adds that same set of mod hdr actions to reset ct_state until the max of mod hdr actions is reached. Also the driver never releases the allocated mod hdr actions and causing a memleak. Fix above two issues by moving CT clear mod acts allocation into the parsing actions phase and only use it when offloading the rule. The release of mod acts will be done in the normal flow_put(). backtrace: [<000000007316e2f3>] krealloc+0x83/0xd0 [<00000000ef157de1>] mlx5e_mod_hdr_alloc+0x147/0x300 [mlx5_core] [<00000000970ce4ae>] mlx5e_tc_match_to_reg_set_and_get_id+0xd7/0x240 [mlx5_core] [<0000000067c5fa17>] mlx5e_tc_match_to_reg_set+0xa/0x20 [mlx5_core] [<00000000d032eb98>] mlx5_tc_ct_entry_set_registers.isra.0+0x36/0xc0 [mlx5_core] [<00000000fd23b869>] mlx5_tc_ct_flow_offload+0x272/0x1f10 [mlx5_core] [<000000004fc24acc>] mlx5e_tc_offload_fdb_rules.part.0+0x150/0x620 [mlx5_core] [<00000000dc741c17>] mlx5e_tc_encap_flows_add+0x489/0x690 [mlx5_core] [<00000000e92e49d7>] mlx5e_rep_update_flows+0x6e4/0x9b0 [mlx5_core] [<00000000f60f5602>] mlx5e_rep_neigh_update+0x39a/0x5d0 [mlx5_core] Fixes: 1ef3018f5af3 ("net/mlx5e: CT: Support clear action") Signed-off-by: Roi Dayan <[email protected]> Reviewed-by: Paul Blakey <[email protected]> Reviewed-by: Maor Dickman <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16net/mlx5: Fix flow counters SF bulk query lenAvihai Horon1-1/+1
When doing a flow counters bulk query, the number of counters to query must be aligned to 4. Current SF bulk query len is not aligned to 4, which leads to an error when trying to query more than 4 counters. Fix it by aligning SF bulk query len to 4. Fixes: 2fdeb4f4c2ae ("net/mlx5: Reduce flow counters bulk query buffer size for SFs") Signed-off-by: Avihai Horon <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16net/mlx5: E-Switch, rebuild lag only when neededMark Bloch1-2/+10
A user can enable VFs without changing E-Switch mode, this can happen when a user moves straight to switchdev mode and only once in switchdev VFs are enabled via the sysfs interface. The cited commit assumed this isn't possible and exposed a single API function where the E-switch calls into the lag code, breaks the lag and prevents any other lag operations to take place until the E-switch update has ended. Breaking the hardware lag when it isn't needed can make it such that hardware lag can't be enabled again. In the sysfs call path check if the current E-Switch mode is NONE, in the context of the function it can only mean the E-Switch is moving out of NONE mode and the hardware lag should be disabled and enabled once the mode change has ended. If the mode isn't NONE it means VFs are about to be enabled and such operation doesn't require toggling the hardware lag. Fixes: cac1eb2cf2e3 ("net/mlx5: Lag, properly lock eswitch if needed") Signed-off-by: Mark Bloch <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16net/mlx5: Update error handler for UCTX and UMEMNeta Ostrovsky1-2/+2
In the fast unload flow, the device state is set to internal error, which indicates that the driver started the destroy process. In this case, when a destroy command is being executed, it should return MLX5_CMD_STAT_OK. Fix MLX5_CMD_OP_DESTROY_UCTX and MLX5_CMD_OP_DESTROY_UMEM to return OK instead of EIO. This fixes a call trace in the umem release process - [ 2633.536695] Call Trace: [ 2633.537518] ib_uverbs_remove_one+0xc3/0x140 [ib_uverbs] [ 2633.538596] remove_client_context+0x8b/0xd0 [ib_core] [ 2633.539641] disable_device+0x8c/0x130 [ib_core] [ 2633.540615] __ib_unregister_device+0x35/0xa0 [ib_core] [ 2633.541640] ib_unregister_device+0x21/0x30 [ib_core] [ 2633.542663] __mlx5_ib_remove+0x38/0x90 [mlx5_ib] [ 2633.543640] auxiliary_bus_remove+0x1e/0x30 [auxiliary] [ 2633.544661] device_release_driver_internal+0x103/0x1f0 [ 2633.545679] bus_remove_device+0xf7/0x170 [ 2633.546640] device_del+0x181/0x410 [ 2633.547606] mlx5_rescan_drivers_locked.part.10+0x63/0x160 [mlx5_core] [ 2633.548777] mlx5_unregister_device+0x27/0x40 [mlx5_core] [ 2633.549841] mlx5_uninit_one+0x21/0xc0 [mlx5_core] [ 2633.550864] remove_one+0x69/0xe0 [mlx5_core] [ 2633.551819] pci_device_remove+0x3b/0xc0 [ 2633.552731] device_release_driver_internal+0x103/0x1f0 [ 2633.553746] unbind_store+0xf6/0x130 [ 2633.554657] kernfs_fop_write+0x116/0x190 [ 2633.555567] vfs_write+0xa5/0x1a0 [ 2633.556407] ksys_write+0x4f/0xb0 [ 2633.557233] do_syscall_64+0x5b/0x1a0 [ 2633.558071] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 2633.559018] RIP: 0033:0x7f9977132648 [ 2633.559821] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 55 6f 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 [ 2633.562332] RSP: 002b:00007fffb1a83888 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 2633.563472] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f9977132648 [ 2633.564541] RDX: 000000000000000c RSI: 000055b90546e230 RDI: 0000000000000001 [ 2633.565596] RBP: 000055b90546e230 R08: 00007f9977406860 R09: 00007f9977a54740 [ 2633.566653] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f99774056e0 [ 2633.567692] R13: 000000000000000c R14: 00007f9977400880 R15: 000000000000000c [ 2633.568725] ---[ end trace 10b4fe52945e544d ]--- Fixes: 6a6fabbfa3e8 ("net/mlx5: Update pci error handler entries and command translation") Signed-off-by: Neta Ostrovsky <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16net/mlx5: DR, Fix check for unsupported fields in match paramYevgeny Kliteynik1-5/+6
The existing loop doesn't cast the buffer while scanning it, which results in out-of-bounds read and failure to create the matcher. Fixes: 941f19798a11 ("net/mlx5: DR, Add check for unsupported fields in match param") Signed-off-by: Yevgeny Kliteynik <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16net/mlx5: DR, Handle eswitch manager and uplink vports separatelyYevgeny Kliteynik2-33/+24
When querying eswitch manager vport capabilities as "other = 1", we encounter a FW compatibility issue with older FW versions. To maintain backward compatibility, eswitch manager vport should be queried as "other = 0" vport both for ECPF and non-ECPF cases. This patch fixes these queries and improves the code readability by handling eswitch manager and uplink vports separately, avoiding the excessive 'if' conditions. Also, uplink caps are stored similar to esw manager and not as part of xarray. Fixes: dd4acb2a0954 ("net/mlx5: DR, Add missing query for vport 0") Signed-off-by: Yevgeny Kliteynik <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16net/mlx5e: nullify cq->dbg pointer in mlx5_debug_cq_remove()Valentine Fatiev2-3/+6
Prior to this patch in case mlx5_core_destroy_cq() failed it proceeds to rest of destroy operations. mlx5_core_destroy_cq() could be called again by user and cause additional call of mlx5_debug_cq_remove(). cq->dbg was not nullify in previous call and cause the crash. Fix it by nullify cq->dbg pointer after removal. Also proceed to destroy operations only if FW return 0 for MLX5_CMD_OP_DESTROY_CQ command. general protection fault, probably for non-canonical address 0x2000300004058: 0000 [#1] SMP PTI CPU: 5 PID: 1228 Comm: python Not tainted 5.15.0-rc5_for_upstream_min_debug_2021_10_14_11_06 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:lockref_get+0x1/0x60 Code: 5d e9 53 ff ff ff 48 8d 7f 70 e8 0a 2e 48 00 c7 85 d0 00 00 00 02 00 00 00 c6 45 70 00 fb 5d c3 c3 cc cc cc cc cc cc cc cc 53 <48> 8b 17 48 89 fb 85 d2 75 3d 48 89 d0 bf 64 00 00 00 48 89 c1 48 RSP: 0018:ffff888137dd7a38 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff888107d5f458 RCX: 00000000fffffffe RDX: 000000000002c2b0 RSI: ffffffff8155e2e0 RDI: 0002000300004058 RBP: ffff888137dd7a88 R08: 0002000300004058 R09: ffff8881144a9f88 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881141d4000 R13: ffff888137dd7c68 R14: ffff888137dd7d58 R15: ffff888137dd7cc0 FS: 00007f4644f2a4c0(0000) GS:ffff8887a2d40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b4500f4380 CR3: 0000000114f7a003 CR4: 0000000000170ea0 Call Trace: simple_recursive_removal+0x33/0x2e0 ? debugfs_remove+0x60/0x60 debugfs_remove+0x40/0x60 mlx5_debug_cq_remove+0x32/0x70 [mlx5_core] mlx5_core_destroy_cq+0x41/0x1d0 [mlx5_core] devx_obj_cleanup+0x151/0x330 [mlx5_ib] ? __pollwait+0xd0/0xd0 ? xas_load+0x5/0x70 ? xa_load+0x62/0xa0 destroy_hw_idr_uobject+0x20/0x80 [ib_uverbs] uverbs_destroy_uobject+0x3b/0x360 [ib_uverbs] uobj_destroy+0x54/0xa0 [ib_uverbs] ib_uverbs_cmd_verbs+0xaf2/0x1160 [ib_uverbs] ? uverbs_finalize_object+0xd0/0xd0 [ib_uverbs] ib_uverbs_ioctl+0xc4/0x1b0 [ib_uverbs] __x64_sys_ioctl+0x3e4/0x8e0 Fixes: 94b960b9deff ("net/mlx5e: Fix memory leak in mlx5_core_destroy_cq() error path") Signed-off-by: Valentine Fatiev <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16net/mlx5: E-Switch, Fix resetting of encap mode when entering switchdevPaul Blakey3-11/+9
E-Switch encap mode is relevant only when in switchdev mode. The RDMA driver can query the encap configuration via mlx5_eswitch_get_encap_mode(). Make sure it returns the currently used mode and not the set one. This reverts the cited commit which reset the encap mode on entering switchdev and fixes the original issue properly. Fixes: 9a64144d683a ("net/mlx5: E-Switch, Fix default encap mode") Signed-off-by: Paul Blakey <[email protected]> Reviewed-by: Mark Bloch <[email protected]> Reviewed-by: Maor Dickman <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16net/mlx5e: Wait for concurrent flow deletion during neigh/fib eventsVlad Buslov3-1/+10
Function mlx5e_take_tmp_flow() skips flows with zero reference count. This can cause syndrome 0x179e84 when the called from neigh or route update code and the skipped flow is not removed from the hardware by the time underlying encap/decap resource is deleted. Add new completion 'del_hw_done' that is completed when flow is unoffloaded. This is safe to do because flow with reference count zero needs to be detached from encap/decap entry before its memory is deallocated, which requires taking the encap_tbl_lock mutex that is held by the event handlers code. Fixes: 8914add2c9e5 ("net/mlx5e: Handle FIB events to update tunnel endpoint device") Signed-off-by: Vlad Buslov <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16net/mlx5e: kTLS, Fix crash in RX resync flowTariq Toukan1-6/+17
For the TLS RX resync flow, we maintain a list of TLS contexts that require some attention, to communicate their resync information to the HW. Here we fix list corruptions, by protecting the entries against movements coming from resync_handle_seq_match(), until their resync handling in napi is fully completed. Fixes: e9ce991bce5b ("net/mlx5e: kTLS, Add resiliency to RX resync failures") Signed-off-by: Tariq Toukan <[email protected]> Reviewed-by: Maxim Mikityanskiy <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-11-16thermal: core: Reset previous low and high trip during thermal zone initManaf Meethalavalappu Pallikunhi1-0/+2
During the suspend is in process, thermal_zone_device_update bails out thermal zone re-evaluation for any sensor trip violation without setting next valid trip to that sensor. It assumes during resume it will re-evaluate same thermal zone and update trip. But when it is in suspend temperature goes down and on resume path while updating thermal zone if temperature is less than previously violated trip, thermal zone set trip function evaluates the same previous high and previous low trip as new high and low trip. Since there is no change in high/low trip, it bails out from thermal zone set trip API without setting any trip. It leads to a case where sensor high trip or low trip is disabled forever even though thermal zone has a valid high or low trip. During thermal zone device init, reset thermal zone previous high and low trip. It resolves above mentioned scenario. Signed-off-by: Manaf Meethalavalappu Pallikunhi <[email protected]> Reviewed-by: Thara Gopinath <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2021-11-16thermal: int340x: Limit Kconfig to 64-bitArnd Bergmann1-2/+2
32-bit processors cannot generally access 64-bit MMIO registers atomically, and it is unknown in which order the two halves of this registers would need to be read: drivers/thermal/intel/int340x_thermal/processor_thermal_mbox.c: In function 'send_mbox_cmd': drivers/thermal/intel/int340x_thermal/processor_thermal_mbox.c:79:37: error: implicit declaration of function 'readq'; did you mean 'readl'? [-Werror=implicit-function-declaration] 79 | *cmd_resp = readq((void __iomem *) (proc_priv->mmio_base + MBOX_OFFSET_DATA)); | ^~~~~ | readl The driver already does not build for anything other than x86, so limit it further to x86-64. Fixes: aeb58c860dc5 ("thermal/drivers/int340x: processor_thermal: Suppot 64 bit RFIM responses") Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Srinivas Pandruvada <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2021-11-16Merge branch 'kvm-selftest' into kvm-masterPaolo Bonzini13-237/+308
- Cleanups for the perf test infrastructure and mapping hugepages - Avoid contention on mmap_sem when the guests start to run - Add event channel upcall support to xen_shinfo_test
2021-11-16IB/hfi1: Properly allocate rdma counter desc memoryDennis Dalessandro1-3/+2
When optional counter support was added the allocation of the memory holding the counter descriptors was not cleared properly. This caused WARN_ON()s in the IB/sysfs code to be hit. This is because the uninitialized memory made some of the counters wrongly look like optional counters. Use kzalloc. While here change the sizeof() calls to use the pointer rather than the name of the type. WARNING: CPU: 0 PID: 32644 at drivers/infiniband/core/sysfs.c:1064 ib_setup_port_attrs+0x7e1/0x890 [ib_core] CPU: 0 PID: 32644 Comm: kworker/0:2 Tainted: G S W 5.15.0+ #36 Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0018.C4.072020161249 07/20/2016 Workqueue: events work_for_cpu_fn RIP: 0010:ib_setup_port_attrs+0x7e1/0x890 [ib_core] RSP: 0018:ffffc90006ea3c40 EFLAGS: 00010202 RAX: 0000000000000068 RBX: ffff888106ad8000 RCX: 0000000000000138 RDX: ffff888126c84c00 RSI: ffff888103c41000 RDI: 0000000000000124 RBP: ffff88810f63a801 R08: ffff888126c8a000 R09: 0000000000000001 R10: ffffffffa09acf20 R11: 0000000000000065 R12: ffff88810f63a800 R13: ffff88810f63a800 R14: ffff88810f63a8e0 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff888667a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005590102cb078 CR3: 000000000240a003 CR4: 00000000001706f0 Call Trace: ib_register_device.cold.44+0x23e/0x2d0 [ib_core] rvt_register_device+0xfa/0x230 [rdmavt] hfi1_register_ib_device+0x623/0x690 [hfi1] init_one.cold.36+0x2d1/0x49b [hfi1] local_pci_probe+0x45/0x80 work_for_cpu_fn+0x16/0x20 process_one_work+0x1b1/0x360 worker_thread+0x1d4/0x3a0 kthread+0x11a/0x140 ret_from_fork+0x22/0x30 Fixes: 5e2ddd1e5982 ("RDMA/counter: Add optional counter support") Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Mike Marciniszyn <[email protected]> Signed-off-by: Dennis Dalessandro <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-11-16RDMA/core: Set send and receive CQ before forwarding to the driverLeon Romanovsky1-0/+3
Preset both receive and send CQ pointers prior to call to the drivers and overwrite it later again till the mlx4 is going to be changed do not overwrite ibqp properties. This change is needed for mlx5, because in case of QP creation failure, it will go to the path of QP destroy which relies on proper CQ pointers. BUG: KASAN: use-after-free in create_qp.cold+0x164/0x16e [mlx5_ib] Write of size 8 at addr ffff8880064c55c0 by task a.out/246 CPU: 0 PID: 246 Comm: a.out Not tainted 5.15.0+ #291 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack_lvl+0x45/0x59 print_address_description.constprop.0+0x1f/0x140 kasan_report.cold+0x83/0xdf create_qp.cold+0x164/0x16e [mlx5_ib] mlx5_ib_create_qp+0x358/0x28a0 [mlx5_ib] create_qp.part.0+0x45b/0x6a0 [ib_core] ib_create_qp_user+0x97/0x150 [ib_core] ib_uverbs_handler_UVERBS_METHOD_QP_CREATE+0x92c/0x1250 [ib_uverbs] ib_uverbs_cmd_verbs+0x1c38/0x3150 [ib_uverbs] ib_uverbs_ioctl+0x169/0x260 [ib_uverbs] __x64_sys_ioctl+0x866/0x14d0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Allocated by task 246: kasan_save_stack+0x1b/0x40 __kasan_kmalloc+0xa4/0xd0 create_qp.part.0+0x92/0x6a0 [ib_core] ib_create_qp_user+0x97/0x150 [ib_core] ib_uverbs_handler_UVERBS_METHOD_QP_CREATE+0x92c/0x1250 [ib_uverbs] ib_uverbs_cmd_verbs+0x1c38/0x3150 [ib_uverbs] ib_uverbs_ioctl+0x169/0x260 [ib_uverbs] __x64_sys_ioctl+0x866/0x14d0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 246: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0x10c/0x150 slab_free_freelist_hook+0xb4/0x1b0 kfree+0xe7/0x2a0 create_qp.part.0+0x52b/0x6a0 [ib_core] ib_create_qp_user+0x97/0x150 [ib_core] ib_uverbs_handler_UVERBS_METHOD_QP_CREATE+0x92c/0x1250 [ib_uverbs] ib_uverbs_cmd_verbs+0x1c38/0x3150 [ib_uverbs] ib_uverbs_ioctl+0x169/0x260 [ib_uverbs] __x64_sys_ioctl+0x866/0x14d0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: 514aee660df4 ("RDMA: Globally allocate and release QP memory") Link: https://lore.kernel.org/r/2dbb2e2cbb1efb188a500e5634be1d71956424ce.1636631035.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-11-16RDMA/netlink: Add __maybe_unused to static inline in C fileLeon Romanovsky1-1/+1
Like other commits in the tree add __maybe_unused to a static inline in a C file because some clang compilers will complain about unused code: >> drivers/infiniband/core/nldev.c:2543:1: warning: unused function '__chk_RDMA_NL_NLDEV' MODULE_ALIAS_RDMA_NETLINK(RDMA_NL_NLDEV, 5); ^ Fixes: e3bf14bdc17a ("rdma: Autoload netlink client modules") Link: https://lore.kernel.org/r/4a8101919b765e01d7fde6f27fd572c958deeb4a.1636267207.git.leonro@nvidia.com Reported-by: kernel test robot <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-11-16btrfs: deprecate BTRFS_IOC_BALANCE ioctlNikolay Borisov1-0/+4
The v2 balance ioctl has been introduced more than 9 years ago. Users of the old v1 ioctl should have long been migrated to it. It's time we deprecate it and eventually remove it. The only known user is in btrfs-progs that tries v1 as a fallback in case v2 is not supported. This is not necessary anymore. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-11-16btrfs: make 1-bit bit-fields of scrub_page unsigned intColin Ian King1-2/+2
The bitfields have_csum and io_error are currently signed which is not recommended as the representation is an implementation defined behaviour. Fix this by making the bit-fields unsigned ints. Fixes: 2c36395430b0 ("btrfs: scrub: remove the anonymous structure from scrub_page") Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-11-16btrfs: check-integrity: fix a warning on write caching disabled diskWang Yugui1-1/+13
When a disk has write caching disabled, we skip submission of a bio with flush and sync requests before writing the superblock, since it's not needed. However when the integrity checker is enabled, this results in reports that there are metadata blocks referred by a superblock that were not properly flushed. So don't skip the bio submission only when the integrity checker is enabled for the sake of simplicity, since this is a debug tool and not meant for use in non-debug builds. fstests/btrfs/220 trigger a check-integrity warning like the following when CONFIG_BTRFS_FS_CHECK_INTEGRITY=y and the disk with WCE=0. btrfs: attempt to write superblock which references block M @5242880 (sdb2/5242880/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)! ------------[ cut here ]------------ WARNING: CPU: 28 PID: 843680 at fs/btrfs/check-integrity.c:2196 btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs] CPU: 28 PID: 843680 Comm: umount Not tainted 5.15.0-0.rc5.39.el8.x86_64 #1 Hardware name: Dell Inc. Precision T7610/0NK70N, BIOS A18 09/11/2019 RIP: 0010:btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs] RSP: 0018:ffffb642afb47940 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000 RDX: 00000000ffffffff RSI: ffff8b722fc97d00 RDI: ffff8b722fc97d00 RBP: ffff8b5601c00000 R08: 0000000000000000 R09: c0000000ffff7fff R10: 0000000000000001 R11: ffffb642afb476f8 R12: ffffffffffffffff R13: ffffb642afb47974 R14: ffff8b5499254c00 R15: 0000000000000003 FS: 00007f00a06d4080(0000) GS:ffff8b722fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fff5cff5ff0 CR3: 00000001c0c2a006 CR4: 00000000001706e0 Call Trace: btrfsic_process_written_block+0x2f7/0x850 [btrfs] __btrfsic_submit_bio.part.19+0x310/0x330 [btrfs] ? bio_associate_blkg_from_css+0xa4/0x2c0 btrfsic_submit_bio+0x18/0x30 [btrfs] write_dev_supers+0x81/0x2a0 [btrfs] ? find_get_pages_range_tag+0x219/0x280 ? pagevec_lookup_range_tag+0x24/0x30 ? __filemap_fdatawait_range+0x6d/0xf0 ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e ? find_first_extent_bit+0x9b/0x160 [btrfs] ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e write_all_supers+0x1b3/0xa70 [btrfs] ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e btrfs_commit_transaction+0x59d/0xac0 [btrfs] close_ctree+0x11d/0x339 [btrfs] generic_shutdown_super+0x71/0x110 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 exit_to_user_mode_prepare+0x1f0/0x200 syscall_exit_to_user_mode+0x12/0x30 do_syscall_64+0x46/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f009f711dfb RSP: 002b:00007fff5cff7928 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 000055b68c6c9970 RCX: 00007f009f711dfb RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055b68c6c9b50 RBP: 0000000000000000 R08: 000055b68c6ca900 R09: 00007f009f795580 R10: 0000000000000000 R11: 0000000000000246 R12: 000055b68c6c9b50 R13: 00007f00a04bf184 R14: 0000000000000000 R15: 00000000ffffffff ---[ end trace 2c4b82abcef9eec4 ]--- S-65536(sdb2/65536/1) --> M-1064960(sdb2/1064960/1) Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Wang Yugui <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-11-16btrfs: silence lockdep when reading chunk tree during mountFilipe Manana1-5/+13
Often some test cases like btrfs/161 trigger lockdep splats that complain about possible unsafe lock scenario due to the fact that during mount, when reading the chunk tree we end up calling blkdev_get_by_path() while holding a read lock on a leaf of the chunk tree. That produces a lockdep splat like the following: [ 3653.683975] ====================================================== [ 3653.685148] WARNING: possible circular locking dependency detected [ 3653.686301] 5.15.0-rc7-btrfs-next-103 #1 Not tainted [ 3653.687239] ------------------------------------------------------ [ 3653.688400] mount/447465 is trying to acquire lock: [ 3653.689320] ffff8c6b0c76e528 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.691054] but task is already holding lock: [ 3653.692155] ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 3653.693978] which lock already depends on the new lock. [ 3653.695510] the existing dependency chain (in reverse order) is: [ 3653.696915] -> #3 (btrfs-chunk-00){++++}-{3:3}: [ 3653.698053] down_read_nested+0x4b/0x140 [ 3653.698893] __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 3653.699988] btrfs_read_lock_root_node+0x31/0x40 [btrfs] [ 3653.701205] btrfs_search_slot+0x537/0xc00 [btrfs] [ 3653.702234] btrfs_insert_empty_items+0x32/0x70 [btrfs] [ 3653.703332] btrfs_init_new_device+0x563/0x15b0 [btrfs] [ 3653.704439] btrfs_ioctl+0x2110/0x3530 [btrfs] [ 3653.705405] __x64_sys_ioctl+0x83/0xb0 [ 3653.706215] do_syscall_64+0x3b/0xc0 [ 3653.706990] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.708040] -> #2 (sb_internal#2){.+.+}-{0:0}: [ 3653.708994] lock_release+0x13d/0x4a0 [ 3653.709533] up_write+0x18/0x160 [ 3653.710017] btrfs_sync_file+0x3f3/0x5b0 [btrfs] [ 3653.710699] __loop_update_dio+0xbd/0x170 [loop] [ 3653.711360] lo_ioctl+0x3b1/0x8a0 [loop] [ 3653.711929] block_ioctl+0x48/0x50 [ 3653.712442] __x64_sys_ioctl+0x83/0xb0 [ 3653.712991] do_syscall_64+0x3b/0xc0 [ 3653.713519] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.714233] -> #1 (&lo->lo_mutex){+.+.}-{3:3}: [ 3653.715026] __mutex_lock+0x92/0x900 [ 3653.715648] lo_open+0x28/0x60 [loop] [ 3653.716275] blkdev_get_whole+0x28/0x90 [ 3653.716867] blkdev_get_by_dev.part.0+0x142/0x320 [ 3653.717537] blkdev_open+0x5e/0xa0 [ 3653.718043] do_dentry_open+0x163/0x390 [ 3653.718604] path_openat+0x3f0/0xa80 [ 3653.719128] do_filp_open+0xa9/0x150 [ 3653.719652] do_sys_openat2+0x97/0x160 [ 3653.720197] __x64_sys_openat+0x54/0x90 [ 3653.720766] do_syscall_64+0x3b/0xc0 [ 3653.721285] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.721986] -> #0 (&disk->open_mutex){+.+.}-{3:3}: [ 3653.722775] __lock_acquire+0x130e/0x2210 [ 3653.723348] lock_acquire+0xd7/0x310 [ 3653.723867] __mutex_lock+0x92/0x900 [ 3653.724394] blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.725041] blkdev_get_by_path+0xb8/0xd0 [ 3653.725614] btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs] [ 3653.726332] open_fs_devices+0xd7/0x2c0 [btrfs] [ 3653.726999] btrfs_read_chunk_tree+0x3ad/0x870 [btrfs] [ 3653.727739] open_ctree+0xb8e/0x17bf [btrfs] [ 3653.728384] btrfs_mount_root.cold+0x12/0xde [btrfs] [ 3653.729130] legacy_get_tree+0x30/0x50 [ 3653.729676] vfs_get_tree+0x28/0xc0 [ 3653.730192] vfs_kern_mount.part.0+0x71/0xb0 [ 3653.730800] btrfs_mount+0x11d/0x3a0 [btrfs] [ 3653.731427] legacy_get_tree+0x30/0x50 [ 3653.731970] vfs_get_tree+0x28/0xc0 [ 3653.732486] path_mount+0x2d4/0xbe0 [ 3653.732997] __x64_sys_mount+0x103/0x140 [ 3653.733560] do_syscall_64+0x3b/0xc0 [ 3653.734080] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.734782] other info that might help us debug this: [ 3653.735784] Chain exists of: &disk->open_mutex --> sb_internal#2 --> btrfs-chunk-00 [ 3653.737123] Possible unsafe locking scenario: [ 3653.737865] CPU0 CPU1 [ 3653.738435] ---- ---- [ 3653.739007] lock(btrfs-chunk-00); [ 3653.739449] lock(sb_internal#2); [ 3653.740193] lock(btrfs-chunk-00); [ 3653.740955] lock(&disk->open_mutex); [ 3653.741431] *** DEADLOCK *** [ 3653.742176] 3 locks held by mount/447465: [ 3653.742739] #0: ffff8c6acf85c0e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xd5/0x3b0 [ 3653.744114] #1: ffffffffc0b28f70 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x59/0x870 [btrfs] [ 3653.745563] #2: ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 3653.747066] stack backtrace: [ 3653.747723] CPU: 4 PID: 447465 Comm: mount Not tainted 5.15.0-rc7-btrfs-next-103 #1 [ 3653.748873] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 3653.750592] Call Trace: [ 3653.750967] dump_stack_lvl+0x57/0x72 [ 3653.751526] check_noncircular+0xf3/0x110 [ 3653.752136] ? stack_trace_save+0x4b/0x70 [ 3653.752748] __lock_acquire+0x130e/0x2210 [ 3653.753356] lock_acquire+0xd7/0x310 [ 3653.753898] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.754596] ? lock_is_held_type+0xe8/0x140 [ 3653.755125] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.755729] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.756338] __mutex_lock+0x92/0x900 [ 3653.756794] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.757400] ? do_raw_spin_unlock+0x4b/0xa0 [ 3653.757930] ? _raw_spin_unlock+0x29/0x40 [ 3653.758437] ? bd_prepare_to_claim+0x129/0x150 [ 3653.758999] ? trace_module_get+0x2b/0xd0 [ 3653.759508] ? try_module_get.part.0+0x50/0x80 [ 3653.760072] blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.760661] ? devcgroup_check_permission+0xc1/0x1f0 [ 3653.761288] blkdev_get_by_path+0xb8/0xd0 [ 3653.761797] btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs] [ 3653.762454] open_fs_devices+0xd7/0x2c0 [btrfs] [ 3653.763055] ? clone_fs_devices+0x8f/0x170 [btrfs] [ 3653.763689] btrfs_read_chunk_tree+0x3ad/0x870 [btrfs] [ 3653.764370] ? kvm_sched_clock_read+0x14/0x40 [ 3653.764922] open_ctree+0xb8e/0x17bf [btrfs] [ 3653.765493] ? super_setup_bdi_name+0x79/0xd0 [ 3653.766043] btrfs_mount_root.cold+0x12/0xde [btrfs] [ 3653.766780] ? rcu_read_lock_sched_held+0x3f/0x80 [ 3653.767488] ? kfree+0x1f2/0x3c0 [ 3653.767979] legacy_get_tree+0x30/0x50 [ 3653.768548] vfs_get_tree+0x28/0xc0 [ 3653.769076] vfs_kern_mount.part.0+0x71/0xb0 [ 3653.769718] btrfs_mount+0x11d/0x3a0 [btrfs] [ 3653.770381] ? rcu_read_lock_sched_held+0x3f/0x80 [ 3653.771086] ? kfree+0x1f2/0x3c0 [ 3653.771574] legacy_get_tree+0x30/0x50 [ 3653.772136] vfs_get_tree+0x28/0xc0 [ 3653.772673] path_mount+0x2d4/0xbe0 [ 3653.773201] __x64_sys_mount+0x103/0x140 [ 3653.773793] do_syscall_64+0x3b/0xc0 [ 3653.774333] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.775094] RIP: 0033:0x7f648bc45aaa This happens because through btrfs_read_chunk_tree(), which is called only during mount, ends up acquiring the mutex open_mutex of a block device while holding a read lock on a leaf of the chunk tree while other paths need to acquire other locks before locking extent buffers of the chunk tree. Since at mount time when we call btrfs_read_chunk_tree() we know that we don't have other tasks running in parallel and modifying the chunk tree, we can simply skip locking of chunk tree extent buffers. So do that and move the assertion that checks the fs is not yet mounted to the top block of btrfs_read_chunk_tree(), with a comment before doing it. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-11-16btrfs: fix memory ordering between normal and ordered work functionsNikolay Borisov1-0/+14
Ordered work functions aren't guaranteed to be handled by the same thread which executed the normal work functions. The only way execution between normal/ordered functions is synchronized is via the WORK_DONE_BIT, unfortunately the used bitops don't guarantee any ordering whatsoever. This manifested as seemingly inexplicable crashes on ARM64, where async_chunk::inode is seen as non-null in async_cow_submit which causes submit_compressed_extents to be called and crash occurs because async_chunk::inode suddenly became NULL. The call trace was similar to: pc : submit_compressed_extents+0x38/0x3d0 lr : async_cow_submit+0x50/0xd0 sp : ffff800015d4bc20 <registers omitted for brevity> Call trace: submit_compressed_extents+0x38/0x3d0 async_cow_submit+0x50/0xd0 run_ordered_work+0xc8/0x280 btrfs_work_helper+0x98/0x250 process_one_work+0x1f0/0x4ac worker_thread+0x188/0x504 kthread+0x110/0x114 ret_from_fork+0x10/0x18 Fix this by adding respective barrier calls which ensure that all accesses preceding setting of WORK_DONE_BIT are strictly ordered before setting the flag. At the same time add a read barrier after reading of WORK_DONE_BIT in run_ordered_work which ensures all subsequent loads would be strictly ordered after reading the bit. This in turn ensures are all accesses before WORK_DONE_BIT are going to be strictly ordered before any access that can occur in ordered_func. Reported-by: Chris Murphy <[email protected]> Fixes: 08a9ff326418 ("btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue") CC: [email protected] # 4.4+ Link: https://bugzilla.redhat.com/show_bug.cgi?id=2011928 Reviewed-by: Josef Bacik <[email protected]> Tested-by: Chris Murphy <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-11-16btrfs: fix a out-of-bound access in copy_compressed_data_to_page()Qu Wenruo1-1/+11
[BUG] The following script can cause btrfs to crash: $ mount -o compress-force=lzo $DEV /mnt $ dd if=/dev/urandom of=/mnt/foo bs=4k count=1 $ sync The call trace looks like this: general protection fault, probably for non-canonical address 0xe04b37fccce3b000: 0000 [#1] PREEMPT SMP NOPTI CPU: 5 PID: 164 Comm: kworker/u20:3 Not tainted 5.15.0-rc7-custom+ #4 Workqueue: btrfs-delalloc btrfs_work_helper [btrfs] RIP: 0010:__memcpy+0x12/0x20 Call Trace: lzo_compress_pages+0x236/0x540 [btrfs] btrfs_compress_pages+0xaa/0xf0 [btrfs] compress_file_range+0x431/0x8e0 [btrfs] async_cow_start+0x12/0x30 [btrfs] btrfs_work_helper+0xf6/0x3e0 [btrfs] process_one_work+0x294/0x5d0 worker_thread+0x55/0x3c0 kthread+0x140/0x170 ret_from_fork+0x22/0x30 ---[ end trace 63c3c0f131e61982 ]--- [CAUSE] In lzo_compress_pages(), parameter @out_pages is not only an output parameter (for the number of compressed pages), but also an input parameter, as the upper limit of compressed pages we can utilize. In commit d4088803f511 ("btrfs: subpage: make lzo_compress_pages() compatible"), the refactoring doesn't take @out_pages as an input, thus completely ignoring the limit. And for compress-force case, we could hit incompressible data that compressed size would go beyond the page limit, and cause the above crash. [FIX] Save @out_pages as @max_nr_page, and pass it to lzo_compress_pages(), and check if we're beyond the limit before accessing the pages. Note: this also fixes crash on 32bit architectures that was suspected to be caused by merge of btrfs patches to 5.16-rc1. Reported in https://lore.kernel.org/all/[email protected]/ . Reported-by: Omar Sandoval <[email protected]> Fixes: d4088803f511 ("btrfs: subpage: make lzo_compress_pages() compatible") Reviewed-by: Omar Sandoval <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add note ] Signed-off-by: David Sterba <[email protected]>
2021-11-16Merge branch '40GbE' of ↵David S. Miller3-22/+64
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2021-11-15 This series contains updates to iavf driver only. Mateusz adds a wait for reset completion when changing queue count which could otherwise cause issues with VF reset. Nick adds a null check for vf_res in iavf_fix_features(), corrects ordering of function calls to resolve dependency issues, and prevents possible freeing of a lock which isn't being held. Piotr fixes logic that did not allow setting all multicast mode without promiscuous mode. Jake prevents possible accidental freeing of filter structure. Mitch adds null checks for key and indir parameters in iavf_get_rxfh(). Surabhi adds an additional check that would, previously, cause the driver to print a false error due to values obtained while the VF is in reset. Grzegorz prevents a queue request of 0 which would cause queue count to reset to default values. Akeem restores VLAN filters when bringing the interface back up. ==================== Signed-off-by: David S. Miller <[email protected]>
2021-11-16KVM: x86: Fix uninitialized eoi_exit_bitmap usage in vcpu_load_eoi_exitmap()黄乐1-2/+6
In vcpu_load_eoi_exitmap(), currently the eoi_exit_bitmap[4] array is initialized only when Hyper-V context is available, in other path it is just passed to kvm_x86_ops.load_eoi_exitmap() directly from on the stack, which would cause unexpected interrupt delivery/handling issues, e.g. an *old* linux kernel that relies on PIT to do clock calibration on KVM might randomly fail to boot. Fix it by passing ioapic_handled_vectors to load_eoi_exitmap() when Hyper-V context is not available. Fixes: f2bc14b69c38 ("KVM: x86: hyper-v: Prepare to meet unallocated Hyper-V context") Cc: [email protected] Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Huang Le <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-16KVM: selftests: Use perf_test_destroy_vm in memslot_modification_stress_testDavid Matlack1-2/+1
Change memslot_modification_stress_test to use perf_test_destroy_vm instead of manually calling ucall_uninit and kvm_vm_free. No functional change intended. Signed-off-by: David Matlack <[email protected]> Reviewed-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-16KVM: selftests: Wait for all vCPU to be created before entering guest modeDavid Matlack1-0/+26
Thread creation requires taking the mmap_sem in write mode, which causes vCPU threads running in guest mode to block while they are populating memory. Fix this by waiting for all vCPU threads to be created and start running before entering guest mode on any one vCPU thread. This substantially improves the "Populate memory time" when using 1GiB pages since it allows all vCPUs to zero pages in parallel rather than blocking because a writer is waiting (which is waiting for another vCPU that is busy zeroing a 1GiB page). Before: $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb ... Populate memory time: 52.811184013s After: $ ./dirty_log_perf_test -v256 -s anonymous_hugetlb_1gb ... Populate memory time: 10.204573342s Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-16KVM: selftests: Move vCPU thread creation and joining to common helpersDavid Matlack6-90/+67
Move vCPU thread creation and joining to common helper functions. This is in preparation for the next commit which ensures that all vCPU threads are fully created before entering guest mode on any one vCPU. No functional change intended. Signed-off-by: David Matlack <[email protected]> Reviewed-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-16KVM: selftests: Start at iteration 0 instead of -1David Matlack1-5/+3
Start at iteration 0 instead of -1 to avoid having to initialize vcpu_last_completed_iteration when setting up vCPU threads. This simplifies the next commit where we move vCPU thread initialization out to a common helper. No functional change intended. Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-16KVM: selftests: Sync perf_test_args to guest during VM creationSean Christopherson6-15/+16
Copy perf_test_args to the guest during VM creation instead of relying on the caller to do so at their leisure. Ideally, tests wouldn't even be able to modify perf_test_args, i.e. they would have no motivation to do the sync, but enforcing that is arguably a net negative for readability. No functional change intended. [Set wr_fract=1 by default and add helper to override it since the new access_tracking_perf_test needs to set it dynamically.] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: David Matlack <[email protected]> Reviewed-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-16KVM: selftests: Fill per-vCPU struct during "perf_test" VM creationSean Christopherson6-54/+45
Fill the per-vCPU args when creating the perf_test VM instead of having the caller do so. This helps ensure that any adjustments to the number of pages (and thus vcpu_memory_bytes) are reflected in the per-VM args. Automatically filling the per-vCPU args will also allow a future patch to do the sync to the guest during creation. Signed-off-by: Sean Christopherson <[email protected]> [Updated access_tracking_perf_test as well.] Signed-off-by: David Matlack <[email protected]> Reviewed-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-16KVM: selftests: Create VM with adjusted number of guest pages for perf testsSean Christopherson1-2/+6
Use the already computed guest_num_pages when creating the so called extra VM pages for a perf test, and add a comment explaining why the pages are allocated as extra pages. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Ben Gardon <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-16KVM: selftests: Remove perf_test_args.host_page_sizeSean Christopherson2-3/+1
Remove perf_test_args.host_page_size and instead use getpagesize() so that it's somewhat obvious that, for tests that care about the host page size, they care about the system page size, not the hardware page size, e.g. that the logic is unchanged if hugepages are in play. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Ben Gardon <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-16KVM: selftests: Move per-VM GPA into perf_test_argsSean Christopherson3-21/+10
Move the per-VM GPA into perf_test_args instead of storing it as a separate global variable. It's not obvious that guest_test_phys_mem holds a GPA, nor that it's connected/coupled with per_vcpu->gpa. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Ben Gardon <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>