aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-11-18powerpc/64s/exception: KVM Fix for host DSI being taken in HPT guest MMU contextNicholas Piggin1-4/+7
Commit 2284ffea8f0c ("powerpc/64s/exception: Only test KVM in SRR interrupts when PR KVM is supported") removed KVM guest tests from interrupts that do not set HV=1, when PR-KVM is not configured. This is wrong for HV-KVM HPT guest MMIO emulation case which attempts to load the faulting instruction word with MSR[DR]=1 and MSR[HV]=1 with the guest MMU context loaded. This can cause host DSI, DSLB interrupts which must test for KVM guest. Restore this and add a comment. Fixes: 2284ffea8f0c ("powerpc/64s/exception: Only test KVM in SRR interrupts when PR KVM is supported") Cc: [email protected] # v5.7+ Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-17ipv4: use IS_ENABLED instead of ifdefFlorian Klink1-1/+1
Checking for ifdef CONFIG_x fails if CONFIG_x=m. Use IS_ENABLED instead, which is true for both built-ins and modules. Otherwise, a > ip -4 route add 1.2.3.4/32 via inet6 fe80::2 dev eth1 fails with the message "Error: IPv6 support not enabled in kernel." if CONFIG_IPV6 is `m`. In the spirit of b8127113d01e53adba15b41aefd37b90ed83d631. Fixes: d15662682db2 ("ipv4: Allow ipv6 gateway with ipv4 routes") Cc: Kim Phillips <[email protected]> Signed-off-by: Florian Klink <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2020-11-17qed: fix ILT configuration of SRC blockDmitry Bogdanov2-5/+2
The code refactoring of ILT configuration was not complete, the old unused variables were used for the SRC block. That could lead to the memory corruption by HW when rx filters are configured. This patch completes that refactoring. Fixes: 8a52bbab39c9 (qed: Debug feature: ilt and mdump) Signed-off-by: Igor Russkikh <[email protected]> Signed-off-by: Ariel Elior <[email protected]> Signed-off-by: Dmitry Bogdanov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2020-11-17inet_diag: Fix error path to cancel the meseage in inet_req_diag_fill()Wang Hai1-1/+3
nlmsg_cancel() needs to be called in the error path of inet_req_diag_fill to cancel the message. Fixes: d545caca827b ("net: inet: diag: expose the socket mark to privileged processes.") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Wang Hai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2020-11-17tools/testing/scatterlist: Fix test to compile and runMaor Gottlieb2-2/+3
Add missing define of ALIGN_DOWN to make the test build and run. In addition, __sg_alloc_table_from_pages now support unaligned maximum segment, so adapt the test result accordingly. Fixes: 07da1223ec93 ("lib/scatterlist: Add support in dynamic allocation of SG table from pages") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Maor Gottlieb <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2020-11-18bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_listJohn Fastabend1-2/+9
When skb has a frag_list its possible for skb_to_sgvec() to fail. This happens when the scatterlist has fewer elements to store pages than would be needed for the initial skb plus any of its frags. This case appears rare, but is possible when running an RX parser/verdict programs exposed to the internet. Currently, when this happens we throw an error, break the pipe, and kfree the msg. This effectively breaks the application or forces it to do a retry. Lets catch this case and handle it by doing an skb_linearize() on any skb we receive with frags. At this point skb_to_sgvec should not fail because the failing conditions would require frags to be in place. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/bpf/160556576837.73229.14800682790808797635.stgit@john-XPS-13-9370
2020-11-18bpf, sockmap: Handle memory acct if skb_verdict prog redirects to selfJohn Fastabend1-0/+8
If the skb_verdict_prog redirects an skb knowingly to itself, fix your BPF program this is not optimal and an abuse of the API please use SK_PASS. That said there may be cases, such as socket load balancing, where picking the socket is hashed based or otherwise picks the same socket it was received on in some rare cases. If this happens we don't want to confuse userspace giving them an EAGAIN error if we can avoid it. To avoid double accounting in these cases. At the moment even if the skb has already been charged against the sockets rcvbuf and forward alloc we check it again and do set_owner_r() causing it to be orphaned and recharged. For one this is useless work, but more importantly we can have a case where the skb could be put on the ingress queue, but because we are under memory pressure we return EAGAIN. The trouble here is the skb has already been accounted for so any rcvbuf checks include the memory associated with the packet already. This rolls up and can result in unnecessary EAGAIN errors in userspace read() calls. Fix by doing an unlikely check and skipping checks if skb->sk == sk. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/bpf/160556574804.73229.11328201020039674147.stgit@john-XPS-13-9370
2020-11-18bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to selfJohn Fastabend1-19/+53
If a socket redirects to itself and it is under memory pressure it is possible to get a socket stuck so that recv() returns EAGAIN and the socket can not advance for some time. This happens because when redirecting a skb to the same socket we received the skb on we first check if it is OK to enqueue the skb on the receiving socket by checking memory limits. But, if the skb is itself the object holding the memory needed to enqueue the skb we will keep retrying from kernel side and always fail with EAGAIN. Then userspace will get a recv() EAGAIN error if there are no skbs in the psock ingress queue. This will continue until either some skbs get kfree'd causing the memory pressure to reduce far enough that we can enqueue the pending packet or the socket is destroyed. In some cases its possible to get a socket stuck for a noticeable amount of time if the socket is only receiving skbs from sk_skb verdict programs. To reproduce I make the socket memory limits ridiculously low so sockets are always under memory pressure. More often though if under memory pressure it looks like a spurious EAGAIN error on user space side causing userspace to retry and typically enough has moved on the memory side that it works. To fix skip memory checks and skb_orphan if receiving on the same sock as already assigned. For SK_PASS cases this is easy, its always the same socket so we can just omit the orphan/set_owner pair. For backlog cases we need to check skb->sk and decide if the orphan and set_owner pair are needed. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/bpf/160556572660.73229.12566203819812939627.stgit@john-XPS-13-9370
2020-11-18bpf, sockmap: Use truesize with sk_rmem_schedule()John Fastabend1-1/+1
We use skb->size with sk_rmem_scheduled() which is not correct. Instead use truesize to align with socket and tcp stack usage of sk_rmem_schedule. Suggested-by: Daniel Borkman <[email protected]> Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/bpf/160556570616.73229.17003722112077507863.stgit@john-XPS-13-9370
2020-11-18bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirectJohn Fastabend2-5/+18
Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This allows users to tune SO_RCVBUF and sockmap will honor them. We can refactor the if(charge) case out in later patches. But, keep this fix to the point. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Suggested-by: Jakub Sitnicki <[email protected]> Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370
2020-11-18bpf, sockmap: Fix partial copy_page_to_iter so progress can still be madeJohn Fastabend1-6/+9
If copy_page_to_iter() fails or even partially completes, but with fewer bytes copied than expected we currently reset sg.start and return EFAULT. This proves problematic if we already copied data into the user buffer before we return an error. Because we leave the copied data in the user buffer and fail to unwind the scatterlist so kernel side believes data has been copied and user side believes data has _not_ been received. Expected behavior should be to return number of bytes copied and then on the next read we need to return the error assuming its still there. This can happen if we have a copy length spanning multiple scatterlist elements and one or more complete before the error is hit. The error is rare enough though that my normal testing with server side programs, such as nginx, httpd, envoy, etc., I have never seen this. The only reliable way to reproduce that I've found is to stream movies over my browser for a day or so and wait for it to hang. Not very scientific, but with a few extra WARN_ON()s in the code the bug was obvious. When we review the errors from copy_page_to_iter() it seems we are hitting a page fault from copy_page_to_iter_iovec() where the code checks fault_in_pages_writeable(buf, copy) where buf is the user buffer. It also seems typical server applications don't hit this case. The other way to try and reproduce this is run the sockmap selftest tool test_sockmap with data verification enabled, but it doesn't reproduce the fault. Perhaps we can trigger this case artificially somehow from the test tools. I haven't sorted out a way to do that yet though. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370
2020-11-17net/tls: Fix wrong record sn in async mode of device resyncTariq Toukan2-11/+42
In async_resync mode, we log the TCP seq of records until the async request is completed. Later, in case one of the logged seqs matches the resync request, we return it, together with its record serial number. Before this fix, we mistakenly returned the serial number of the current record instead. Fixes: ed9b7646b06a ("net/tls: Add asynchronous resync") Signed-off-by: Tariq Toukan <[email protected]> Reviewed-by: Boris Pismenny <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2020-11-18interconnect: qcom: msm8974: Don't boost the NoC rate during bootGeorgi Djakov1-0/+9
It has been reported that on Fairphone 2 (msm8974-based), increasing the clock rate for some of the NoCs during boot may lead to hangs. Let's restore the original behavior and not touch the clock rate of any of the NoCs to fix the regression. Reported-by: Luca Weiss <[email protected]> Tested-by: Luca Weiss <[email protected]> Fixes: b1d681d8d324 ("interconnect: Add sync state support") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Georgi Djakov <[email protected]>
2020-11-18interconnect: qcom: msm8974: Prevent integer overflow in rateGeorgi Djakov1-0/+3
When sync_state support got introduced recently, by default we try to set the NoCs to run initially at maximum rate. But as these values are aggregated, we may end with a really big clock rate value, which is then converted from "u64" to "long" during the clock rate rounding. But on 32bit platforms this may result an overflow. Fix it by making sure that the rate is within range. Reported-by: Luca Weiss <[email protected]> Reviewed-by: Brian Masney <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Georgi Djakov <[email protected]>
2020-11-17io_uring: don't double complete failed reissue requestJens Axboe1-1/+0
Zorro reports that an xfstest test case is failing, and it turns out that for the reissue path we can potentially issue a double completion on the request for the failure path. There's an issue around the retry as well, but for now, at least just make sure that we handle the error path correctly. Cc: [email protected] Fixes: b63534c41e20 ("io_uring: re-issue block requests that failed because of resources") Reported-by: Zorro Lang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-11-17netdevsim: set .owner to THIS_MODULETaehee Yoo3-0/+4
If THIS_MODULE is not set, the module would be removed while debugfs is being used. It eventually makes kernel panic. Fixes: 82c93a87bf8b ("netdevsim: implement couple of testing devlink health reporters") Fixes: 424be63ad831 ("netdevsim: add UDP tunnel port offload support") Fixes: 4418f862d675 ("netdevsim: implement support for devlink region and snapshots") Fixes: d3cbb907ae57 ("netdevsim: add ACL trap reporting cookie as a metadata") Signed-off-by: Taehee Yoo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2020-11-17seccomp: Set PF_SUPERPRIV when checking capabilityMickaël Salaün1-3/+2
Replace the use of security_capable(current_cred(), ...) with ns_capable_noaudit() which set PF_SUPERPRIV. Since commit 98f368e9e263 ("kernel: Add noaudit variant of ns_capable()"), a new ns_capable_noaudit() helper is available. Let's use it! Cc: Jann Horn <[email protected]> Cc: Kees Cook <[email protected]> Cc: Tyler Hicks <[email protected]> Cc: Will Drewry <[email protected]> Cc: [email protected] Fixes: e2cfabdfd075 ("seccomp: add system call filtering using BPF") Signed-off-by: Mickaël Salaün <[email protected]> Reviewed-by: Jann Horn <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-17ptrace: Set PF_SUPERPRIV when checking capabilityMickaël Salaün1-11/+5
Commit 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat") replaced the use of ns_capable() with has_ns_capability{,_noaudit}() which doesn't set PF_SUPERPRIV. Commit 6b3ad6649a4c ("ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()") replaced has_ns_capability{,_noaudit}() with security_capable(), which doesn't set PF_SUPERPRIV neither. Since commit 98f368e9e263 ("kernel: Add noaudit variant of ns_capable()"), a new ns_capable_noaudit() helper is available. Let's use it! As a result, the signature of ptrace_has_cap() is restored to its original one. Cc: Christian Brauner <[email protected]> Cc: Eric Paris <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kees Cook <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Tyler Hicks <[email protected]> Cc: [email protected] Fixes: 6b3ad6649a4c ("ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()") Fixes: 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat") Signed-off-by: Mickaël Salaün <[email protected]> Reviewed-by: Jann Horn <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-17enetc: Workaround for MDIO register access issueAlex Marginean4-25/+161
Due to a hardware issue, an access to MDIO registers that is concurrent with other ENETC register accesses may lead to the MDIO access being dropped or corrupted. The workaround introduces locking for all register accesses to the ENETC register space. To reduce performance impact, a readers-writers locking scheme has been implemented. The writer in this case is the MDIO access code (irrelevant whether that MDIO access is a register read or write), and the reader is any access code to non-MDIO ENETC registers. Also, the datapath functions acquire the read lock fewer times and use _hot accessors. All the rest of the code uses the _wa accessors which lock every register access. The commit introducing MDIO support is - commit ebfcb23d62ab ("enetc: Add ENETC PF level external MDIO support") but due to subsequent refactoring this patch is applicable on top of a later commit. Fixes: 6517798dd343 ("enetc: Make MDIO accessors more generic and export to include/linux/fsl") Signed-off-by: Alex Marginean <[email protected]> Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: Claudiu Manoil <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2020-11-17MAINTAINERS: Remove myself as LPC32xx maintainersSylvain Lemieux1-1/+0
I appreciate my time as a kernel maintainer for the last few years. I am no longer working on the LPC32xx platform and cannot commit time for support and discussions. Signed-off-by: Sylvain Lemieux <[email protected]> Acked-by: Vladimir Zapolskiy <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2020-11-17Merge branch 'for-linus' of ↵Linus Torvalds8-18/+55
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input fixes from Dmitry Torokhov: "A fix for use-after-free in the Sun keyboard driver, a fix to firmware updates on newer ICs in the Elan touchpad diver, and a couple misc driver fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: elan_i2c - fix firmware update on newer ICs Input: resistive-adc-touch - fix kconfig dependency on IIO_BUFFER Input: sunkbd - avoid use-after-free in teardown paths Input: i8042 - allow insmod to succeed on devices without an i8042 controller Input: adxl34x - clean up a data type in adxl34x_probe()
2020-11-17net/mlx5: fix error return code in mlx5e_tc_nic_init()Wang Hai1-1/+3
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: aedd133d17bc ("net/mlx5e: Support CT offload for tc nic flows") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Wang Hai <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2020-11-17net/mlx5: E-Switch, Fail mlx5_esw_modify_vport_rate if qos disabledEli Cohen1-0/+4
Avoid calling mlx5_esw_modify_vport_rate() if qos is not enabled and avoid unnecessary syndrome messages from firmware. Fixes: fcb64c0f5640 ("net/mlx5: E-Switch, add ingress rate support") Signed-off-by: Eli Cohen <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2020-11-17net/mlx5: Disable QoS when min_rates on all VFs are zeroVladyslav Tarasiuk1-7/+8
Currently when QoS is enabled for VF and any min_rate is configured, the driver sets bw_share value to at least 1 and doesn’t allow to set it to 0 to make minimal rate unlimited. It means there is always a minimal rate configured for every VF, even if user tries to remove it. In order to make QoS disable possible, check whether all vports have configured min_rate = 0. If this is true, set their bw_share to 0 to disable min_rate limitations. Fixes: c9497c98901c ("net/mlx5: Add support for setting VF min rate") Signed-off-by: Vladyslav Tarasiuk <[email protected]> Reviewed-by: Moshe Shemesh <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2020-11-17net/mlx5: Clear bw_share upon VF disableVladyslav Tarasiuk1-0/+1
Currently, if user disables VFs with some min and max rates configured, they are cleared. But QoS data is not cleared and restored upon next VF enable placing limits on minimal rate for given VF, when user expects none. To match cleared vport->info struct with QoS-related min and max rates upon VF disable, clear vport->qos struct too. Fixes: 556b9d16d3f5 ("net/mlx5: Clear VF's configuration on disabling SRIOV") Signed-off-by: Vladyslav Tarasiuk <[email protected]> Reviewed-by: Moshe Shemesh <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2020-11-17net/mlx5: Add handling of port type in rule deletionMichael Guralnik1-0/+7
Handle destruction of rules with port destination type to enable full destruction of flow. Without this handling of TX rules the deletion of these rules fails. Dmesg of flow destruction failure: [ 203.714146] mlx5_core 0000:00:0b.0: mlx5_cmd_check:753:(pid 342): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x144b7a) [ 210.547387] ------------[ cut here ]------------ [ 210.548663] refcount_t: decrement hit 0; leaking memory. [ 210.550651] WARNING: CPU: 4 PID: 342 at lib/refcount.c:31 refcount_warn_saturate+0x5c/0x110 [ 210.550654] Modules linked in: mlx5_ib mlx5_core ib_ipoib rdma_ucm rdma_cm iw_cm ib_cm ib_umad ib_uverbs ib_core [ 210.550675] CPU: 4 PID: 342 Comm: test Not tainted 5.8.0-rc2+ #116 [ 210.550678] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 210.550680] RIP: 0010:refcount_warn_saturate+0x5c/0x110 [ 210.550685] Code: c6 d1 1b 01 00 0f 84 ad 00 00 00 5b 5d c3 80 3d b5 d1 1b 01 00 75 f4 48 c7 c7 20 d1 15 82 c6 05 a5 d1 1b 01 01 e8 a7 eb af ff <0f> 0b eb dd 80 3d 99 d1 1b 01 00 75 d4 48 c7 c7 c0 cf 15 82 c6 05 [ 210.550687] RSP: 0018:ffff8881642e77e8 EFLAGS: 00010282 [ 210.550691] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000 [ 210.550694] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ffffed102c85ceef [ 210.550696] RBP: ffff888161720428 R08: ffffffff8124c10e R09: ffffed103243beae [ 210.550698] R10: ffff8881921df56b R11: ffffed103243bead R12: ffff8881841b4180 [ 210.550701] R13: ffff888161720428 R14: ffff8881616d0000 R15: ffff888161720380 [ 210.550704] FS: 00007fc27f025740(0000) GS:ffff888192000000(0000) knlGS:0000000000000000 [ 210.550706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 210.550708] CR2: 0000557e4b41a6a0 CR3: 0000000002415004 CR4: 0000000000360ea0 [ 210.550711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 210.550713] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 210.550715] Call Trace: [ 210.550717] mlx5_del_flow_rules+0x484/0x490 [mlx5_core] [ 210.550720] ? mlx5_cmd_set_fte+0xa80/0xa80 [mlx5_core] [ 210.550722] mlx5_ib_destroy_flow+0x17f/0x280 [mlx5_ib] [ 210.550724] uverbs_free_flow+0x4c/0x90 [ib_uverbs] [ 210.550726] destroy_hw_idr_uobject+0x41/0xb0 [ib_uverbs] [ 210.550728] uverbs_destroy_uobject+0xaa/0x390 [ib_uverbs] [ 210.550731] __uverbs_cleanup_ufile+0x129/0x1b0 [ib_uverbs] [ 210.550733] ? uverbs_destroy_uobject+0x390/0x390 [ib_uverbs] [ 210.550735] uverbs_destroy_ufile_hw+0x78/0x190 [ib_uverbs] [ 210.550737] ib_uverbs_close+0x36/0x140 [ib_uverbs] [ 210.550739] __fput+0x181/0x380 [ 210.550741] task_work_run+0x88/0xd0 [ 210.550743] do_exit+0x5f6/0x13b0 [ 210.550745] ? sched_clock_cpu+0x30/0x140 [ 210.550747] ? is_current_pgrp_orphaned+0x70/0x70 [ 210.550750] ? lock_downgrade+0x360/0x360 [ 210.550752] ? mark_held_locks+0x1d/0x90 [ 210.550754] do_group_exit+0x8a/0x140 [ 210.550756] get_signal+0x20a/0xf50 [ 210.550758] do_signal+0x8c/0xbe0 [ 210.550760] ? hrtimer_nanosleep+0x1d8/0x200 [ 210.550762] ? nanosleep_copyout+0x50/0x50 [ 210.550764] ? restore_sigcontext+0x320/0x320 [ 210.550766] ? __hrtimer_init+0xf0/0xf0 [ 210.550768] ? timespec64_add_safe+0x150/0x150 [ 210.550770] ? mark_held_locks+0x1d/0x90 [ 210.550772] ? lockdep_hardirqs_on_prepare+0x14c/0x240 [ 210.550774] __prepare_exit_to_usermode+0x119/0x170 [ 210.550776] do_syscall_64+0x65/0x300 [ 210.550778] ? trace_hardirqs_off+0x10/0x120 [ 210.550781] ? mark_held_locks+0x1d/0x90 [ 210.550783] ? asm_sysvec_apic_timer_interrupt+0xa/0x20 [ 210.550785] ? lockdep_hardirqs_on+0x112/0x190 [ 210.550787] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 210.550789] RIP: 0033:0x7fc27f1cd157 [ 210.550791] Code: Bad RIP value. [ 210.550793] RSP: 002b:00007ffd4db27ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000023 [ 210.550798] RAX: fffffffffffffdfc RBX: ffffffffffffff80 RCX: 00007fc27f1cd157 [ 210.550800] RDX: 00007fc27f025740 RSI: 00007ffd4db27eb0 RDI: 00007ffd4db27eb0 [ 210.550803] RBP: 0000000000000016 R08: 0000000000000000 R09: 000000000000000e [ 210.550805] R10: 00007ffd4db27dc7 R11: 0000000000000246 R12: 0000000000400c00 [ 210.550808] R13: 00007ffd4db285f0 R14: 0000000000000000 R15: 0000000000000000 [ 210.550809] irq event stamp: 49399 [ 210.550812] hardirqs last enabled at (49399): [<ffffffff81172d36>] console_unlock+0x556/0x6f0 [ 210.550815] hardirqs last disabled at (49398): [<ffffffff81172897>] console_unlock+0xb7/0x6f0 [ 210.550818] softirqs last enabled at (48706): [<ffffffff81e0037b>] __do_softirq+0x37b/0x60c [ 210.550820] softirqs last disabled at (48697): [<ffffffff81c00e2f>] asm_call_on_stack+0xf/0x20 [ 210.550822] ---[ end trace ad18c0e6fa846454 ]--- [ 210.581862] mlx5_core 0000:00:0c.0: mlx5_destroy_flow_table:2132:(pid 342): Flow table 262150 wasn't destroyed, refcount > 1 Fixes: a7ee18bdee83 ("RDMA/mlx5: Allow creating a matcher for a NIC TX flow table") Signed-off-by: Michael Guralnik <[email protected]> Reviewed-by: Mark Bloch <[email protected]> Reviewed-by: Maor Gottlieb <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2020-11-17net/mlx5e: Fix check if netdev is bond slaveMaor Dickman1-1/+1
Bond events handler uses bond_slave_get_rtnl to check if net device is bond slave. bond_slave_get_rtnl return the rcu rx_handler pointer from the netdev which exists for bond slaves but also exists for devices that are attached to linux bridge so using it as indication for bond slave is wrong. Fix by using netif_is_lag_port instead. Fixes: 7e51891a237f ("net/mlx5e: Use netdev events to set/del egress acl forward-to-vport rule") Signed-off-by: Maor Dickman <[email protected]> Reviewed-by: Raed Salem <[email protected]> Reviewed-by: Ariel Levkovich <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2020-11-17net/mlx5e: Fix IPsec packet drop by mlx5e_tc_update_skbHuy Nguyen4-13/+16
Both TC and IPsec crypto offload use metadata_regB to store private information. Since TC does not use bit 31 of regB, IPsec will use bit 31 as the IPsec packet marker. The IPsec's regB usage is changed to: Bit31: IPsec marker Bit30-24: IPsec syndrome Bit23-0: IPsec obj id Fixes: b2ac7541e377 ("net/mlx5e: IPsec: Add Connect-X IPsec Rx data path offload") Signed-off-by: Huy Nguyen <[email protected]> Reviewed-by: Raed Salem <[email protected]> Reviewed-by: Ariel Levkovich <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2020-11-17net/mlx5e: Set IPsec WAs only in IP's non checksum partial case.Huy Nguyen1-7/+6
The IP's checksum partial still requires L4 csum flag on Ethernet WQE. Make the IPsec WAs only for the IP's non checksum partial case (for example icmd packet) Fixes: 5be019040cb7 ("net/mlx5e: IPsec: Add Connect-X IPsec Tx data path offload") Signed-off-by: Huy Nguyen <[email protected]> Reviewed-by: Raed Salem <[email protected]> Reviewed-by: Alaa Hleihel <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2020-11-17net/mlx5e: Fix refcount leak on kTLS RX resyncMaxim Mikityanskiy1-5/+8
On resync, the driver calls inet_lookup_established (__inet6_lookup_established) that increases sk_refcnt of the socket. To decrease it, the driver set skb->destructor to sock_edemux. However, it didn't work well, because the TCP stack also sets this destructor for early demux, and the refcount gets decreased only once, while increased two times (in mlx5e and in the TCP stack). It leads to a socket leak, a TLS context leak, which in the end leads to calling tls_dev_del twice: on socket close and on driver unload, which in turn leads to a crash. This commit fixes the refcount leak by calling sock_gen_put right away after using the socket, thus fixing all the subsequent issues. Fixes: 0419d8c9d8f8 ("net/mlx5e: kTLS, Add kTLS RX resync support") Signed-off-by: Maxim Mikityanskiy <[email protected]> Reviewed-by: Tariq Toukan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2020-11-17Merge tag 's390-5.10-4' of ↵Linus Torvalds3-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Heiko Carstens: - fix system call exit path; avoid return to user space with any TIF/CIF/PIF set - fix file permission for cpum_sfb_size parameter - another small defconfig update * tag 's390-5.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/cpum_sf.c: fix file permission for cpum_sfb_size s390: update defconfigs s390: fix system call exit path
2020-11-17Merge tag 'mips_fixes_5.10_1' of ↵Linus Torvalds3-4/+12
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS fixes from Thomas Bogendoerfer: - fix bug preventing booting on several platforms - fix for build error, when modules need has_transparent_hugepage - fix for memleak in alchemy clk setup * tag 'mips_fixes_5.10_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: MIPS: Alchemy: Fix memleak in alchemy_clk_setup_cpu MIPS: kernel: Fix for_each_memblock conversion MIPS: export has_transparent_hugepage() for modules
2020-11-17tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimateRyan Sharpelletti1-1/+1
During loss recovery, retransmitted packets are forced to use TCP timestamps to calculate the RTT samples, which have a millisecond granularity. BBR is designed using a microsecond granularity. As a result, multiple RTT samples could be truncated to the same RTT value during loss recovery. This is problematic, as BBR will not enter PROBE_RTT if the RTT sample is <= the current min_rtt sample, meaning that if there are persistent losses, PROBE_RTT will constantly be pushed off and potentially never re-entered. This patch makes sure that BBR enters PROBE_RTT by checking if RTT sample is < the current min_rtt sample, rather than <=. The Netflix transport/TCP team discovered this bug in the Linux TCP BBR code during lab tests. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Ryan Sharpelletti <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2020-11-17net: ftgmac100: Fix crash when removing driverJoel Stanley1-0/+4
When removing the driver we would hit BUG_ON(!list_empty(&dev->ptype_specific)) in net/core/dev.c due to still having the NC-SI packet handler registered. # echo 1e660000.ethernet > /sys/bus/platform/drivers/ftgmac100/unbind ------------[ cut here ]------------ kernel BUG at net/core/dev.c:10254! Internal error: Oops - BUG: 0 [#1] SMP ARM CPU: 0 PID: 115 Comm: sh Not tainted 5.10.0-rc3-next-20201111-00007-g02e0365710c4 #46 Hardware name: Generic DT based system PC is at netdev_run_todo+0x314/0x394 LR is at cpumask_next+0x20/0x24 pc : [<806f5830>] lr : [<80863cb0>] psr: 80000153 sp : 855bbd58 ip : 00000001 fp : 855bbdac r10: 80c03d00 r9 : 80c06228 r8 : 81158c54 r7 : 00000000 r6 : 80c05dec r5 : 80c05d18 r4 : 813b9280 r3 : 813b9054 r2 : 8122c470 r1 : 00000002 r0 : 00000002 Flags: Nzcv IRQs on FIQs off Mode SVC_32 ISA ARM Segment none Control: 00c5387d Table: 85514008 DAC: 00000051 Process sh (pid: 115, stack limit = 0x7cb5703d) ... Backtrace: [<806f551c>] (netdev_run_todo) from [<80707eec>] (rtnl_unlock+0x18/0x1c) r10:00000051 r9:854ed710 r8:81158c54 r7:80c76bb0 r6:81158c10 r5:8115b410 r4:813b9000 [<80707ed4>] (rtnl_unlock) from [<806f5db8>] (unregister_netdev+0x2c/0x30) [<806f5d8c>] (unregister_netdev) from [<805a8180>] (ftgmac100_remove+0x20/0xa8) r5:8115b410 r4:813b9000 [<805a8160>] (ftgmac100_remove) from [<805355e4>] (platform_drv_remove+0x34/0x4c) Fixes: bd466c3fb5a4 ("net/faraday: Support NCSI mode") Signed-off-by: Joel Stanley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2020-11-17KVM: arm64: vgic-v3: Drop the reporting of GICR_TYPER.Last for userspaceZenghui Yu1-2/+20
It was recently reported that if GICR_TYPER is accessed before the RD base address is set, we'll suffer from the unset @rdreg dereferencing. Oops... gpa_t last_rdist_typer = rdreg->base + GICR_TYPER + (rdreg->free_index - 1) * KVM_VGIC_V3_REDIST_SIZE; It's "expected" that users will access registers in the redistributor if the RD has been properly configured (e.g., the RD base address is set). But it hasn't yet been covered by the existing documentation. Per discussion on the list [1], the reporting of the GICR_TYPER.Last bit for userspace never actually worked. And it's difficult for us to emulate it correctly given that userspace has the flexibility to access it any time. Let's just drop the reporting of the Last bit for userspace for now (userspace should have full knowledge about it anyway) and it at least prevents kernel from panic ;-) [1] https://lore.kernel.org/kvmarm/[email protected]/ Fixes: ba7b3f1275fd ("KVM: arm/arm64: Revisit Redistributor TYPER last bit computation") Reported-by: Keqian Zhu <[email protected]> Signed-off-by: Zenghui Yu <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Reviewed-by: Eric Auger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Cc: [email protected]
2020-11-17net: b44: fix error return code in b44_init_one()Zhang Changzhong1-1/+2
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: 39a6f4bce6b4 ("b44: replace the ssb_dma API with the generic DMA API") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Zhang Changzhong <[email protected]> Reviewed-by: Michael Chan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2020-11-17Merge tag 'perf-tools-fixes-for-v5.10-2020-11-17' of ↵Linus Torvalds9-25/+34
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux Pull perf tools fixes from Arnaldo Carvalho de Melo: - Fix file corruption due to event deletion in 'perf inject'. - Update arch/x86/lib/mem{cpy,set}_64.S copies used in 'perf bench mem memcpy', silencing perf build warning. - Avoid an msan warning in a copied stack in 'perf test'. - Correct tracepoint field name "flags" in ARM's CS-ETM hardware tracing 'perf test' entry. - Update branch sample pattern for cs-etm to cope with excluding guest in userspace counting. - Don't free "lock_seq_stat" if read_count isn't zero in 'perf lock'. * tag 'perf-tools-fixes-for-v5.10-2020-11-17' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: perf test: Avoid an msan warning in a copied stack. perf inject: Fix file corruption due to event deletion perf test: Update branch sample pattern for cs-etm perf test: Fix a typo in cs-etm testing tools arch: Update arch/x86/lib/mem{cpy,set}_64.S copies used in 'perf bench mem memcpy' perf lock: Don't free "lock_seq_stat" if read_count isn't zero perf lock: Correct field name "flags"
2020-11-17qed: fix error return code in qed_iwarp_ll2_start()Zhang Changzhong1-3/+9
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: 469981b17a4f ("qed: Add unaligned and packed packet processing") Fixes: fcb39f6c10b2 ("qed: Add mpa buffer descriptors for storing and processing mpa fpdus") Fixes: 1e28eaad07ea ("qed: Add iWARP support for fpdu spanned over more than two tcp packets") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Zhang Changzhong <[email protected]> Acked-by: Michal Kalderon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2020-11-17Merge branch 'urgent-fixes' of ↵Linus Torvalds1-5/+17
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "A single commit that fixes a bug that was introduced a couple of merge windows ago, but which rather more recently converged to an agreed-upon fix. The bug is that interrupts can be incorrectly enabled while holding an irq-disabled spinlock. This can of course result in self-deadlocks. The bug is a bit difficult to trigger. It requires that a preempted task be blocking a preemptible-RCU grace period long enough to trigger an RCU CPU stall warning. In addition, an interrupt must occur at just the right time, and that interrupt's handler must acquire that same irq-disabled spinlock. Still, a deadlock is a deadlock. Furthermore, we do now have a fix, and that fix survives kernel test robot, -next, and rcutorture testing. It has also been verified by Sebastian as fixing the bug. Therefore..." * 'urgent-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled
2020-11-17drm/sun4i: dw-hdmi: fix error return code in sun8i_dw_hdmi_bind()Xiongfeng Wang1-0/+1
Fix to return a negative error code from the error handling case instead of 0 in function sun8i_dw_hdmi_bind(). Fixes: b7c7436a5ff0 ("drm/sun4i: Implement A83T HDMI driver") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Xiongfeng Wang <[email protected]> Reviewed-by: Jernej Skrabec <[email protected]> Signed-off-by: Jernej Skrabec <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2020-11-17spi: npcm-fiu: Don't leak SPI master in probe error pathLukas Wunner1-1/+1
If the calls to of_match_device(), of_alias_get_id(), devm_ioremap_resource(), devm_regmap_init_mmio() or devm_clk_get() fail on probe of the NPCM FIU SPI driver, the spi_controller struct is erroneously not freed. Fix by switching over to the new devm_spi_alloc_master() helper. Fixes: ace55c411b11 ("spi: npcm-fiu: add NPCM FIU controller driver") Signed-off-by: Lukas Wunner <[email protected]> Cc: <[email protected]> # v5.4+: 5e844cc37a5c: spi: Introduce device-managed SPI controller allocation Cc: <[email protected]> # v5.4+ Cc: Tomer Maimon <[email protected]> Link: https://lore.kernel.org/r/a420c23a363a3bc9aa684c6e790c32a8af106d17.1605512876.git.lukas@wunner.de Signed-off-by: Mark Brown <[email protected]>
2020-11-17spi: dw: Set transfer handler before unmasking the IRQsSerge Semin1-2/+2
It turns out the IRQs most like can be unmasked before the controller is enabled with no problematic consequences. The manual doesn't explicitly state that, but the examples perform the controller initialization procedure in that order. So the commit da8f58909e7e ("spi: dw: Unmask IRQs after enabling the chip") hasn't been that required as I thought. But anyway setting the IRQs up after the chip enabling still worth adding since it has simplified the code a bit. The problem is that it has introduced a potential bug. The transfer handler pointer is now initialized after the IRQs are enabled. That may and eventually will cause an invalid or uninitialized callback invocation. Fix that just by performing the callback initialization before the IRQ unmask procedure. Fixes: da8f58909e7e ("spi: dw: Unmask IRQs after enabling the chip") Signed-off-by: Serge Semin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2020-11-17arm64: dts: qcom: clear the warnings caused by empty dma-rangesZhen Lei1-36/+36
The scripts/dtc/checks.c requires that the node have empty "dma-ranges" property must have the same "#address-cells" and "#size-cells" values as the parent node. Otherwise, the following warnings is reported: arch/arm64/boot/dts/qcom/ipq6018.dtsi:185.3-14: Warning \ (dma_ranges_format): /soc:dma-ranges: empty "dma-ranges" property but \ its #address-cells (1) differs from / (2) arch/arm64/boot/dts/qcom/ipq6018.dtsi:185.3-14: Warning \ (dma_ranges_format): /soc:dma-ranges: empty "dma-ranges" property but \ its #size-cells (1) differs from / (2) Arnd Bergmann figured out why it's necessary: Also note that the #address-cells=<1> means that any device under this bus is assumed to only support 32-bit addressing, and DMA will have to go through a slow swiotlb in the absence of an IOMMU. Suggested-by: Arnd Bergmann <[email protected]> Signed-off-by: Zhen Lei <[email protected]> Reviewed-by: Bjorn Andersson <[email protected]> Link: https://lore.kernel.org/r/[email protected]' Signed-off-by: Arnd Bergmann <[email protected]>
2020-11-17arm64: dts: broadcom: clear the warnings caused by empty dma-rangesZhen Lei1-10/+10
The scripts/dtc/checks.c requires that the node have empty "dma-ranges" property must have the same "#address-cells" and "#size-cells" values as the parent node. Otherwise, the following warnings is reported: arch/arm64/boot/dts/broadcom/stingray/stingray-usb.dtsi:7.3-14: Warning \ (dma_ranges_format): /usb:dma-ranges: empty "dma-ranges" property but \ its #address-cells (1) differs from / (2) arch/arm64/boot/dts/broadcom/stingray/stingray-usb.dtsi:7.3-14: Warning \ (dma_ranges_format): /usb:dma-ranges: empty "dma-ranges" property but \ its #size-cells (1) differs from / (2) Arnd Bergmann figured out why it's necessary: Also note that the #address-cells=<1> means that any device under this bus is assumed to only support 32-bit addressing, and DMA will have to go through a slow swiotlb in the absence of an IOMMU. Suggested-by: Arnd Bergmann <[email protected]> Signed-off-by: Zhen Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected]' Acked-by: Florian Fainelli <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2020-11-17xtensa: uaccess: Add missing __user to strncpy_from_user() prototypeLaurent Pinchart1-1/+1
When adding __user annotations in commit 2adf5352a34a, the strncpy_from_user() function declaration for the CONFIG_GENERIC_STRNCPY_FROM_USER case was missed. Fix it. Reported-by: kernel test robot <[email protected]> Signed-off-by: Laurent Pinchart <[email protected]> Message-Id: <[email protected]> Signed-off-by: Max Filippov <[email protected]>
2020-11-17ALSA: usb-audio: Add delay quirk for all Logitech USB devicesJoakim Tjernlund1-5/+5
Found one more Logitech device, BCC950 ConferenceCam, which needs the same delay here. This makes 3 out of 3 devices I have tried. Therefore, add a delay for all Logitech devices as it does not hurt. Signed-off-by: Joakim Tjernlund <[email protected]> Cc: <[email protected]> # 4.19.y, 5.4.y Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Iwai <[email protected]>
2020-11-17Merge branch 'cpufreq/arm/fixes' of ↵Rafael J. Wysocki2-12/+27
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm Pull cpufreq-arm fixes for 5.10-rc5 from Viresh Kumar: "- tegra186: Fix ->get() callback. - arm/scmi: Add dummy clock provider to fix failure." * 'cpufreq/arm/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm: cpufreq: scmi: Fix OPP addition failure with a dummy clock provider cpufreq: tegra186: Fix get frequency callback
2020-11-17perf/x86: fix sysfs type mismatchesSami Tolvanen4-24/+12
This change switches rapl to use PMU_FORMAT_ATTR, and fixes two other macros to use device_attribute instead of kobj_attribute to avoid callback type mismatches that trip indirect call checking with Clang's Control-Flow Integrity (CFI). Reported-by: Sedat Dilek <[email protected]> Signed-off-by: Sami Tolvanen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-17lockdep: Put graph lock/unlock under lock_recursion protectionBoqun Feng1-2/+4
A warning was hit when running xfstests/generic/068 in a Hyper-V guest: [...] ------------[ cut here ]------------ [...] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled()) [...] WARNING: CPU: 2 PID: 1350 at kernel/locking/lockdep.c:5280 check_flags.part.0+0x165/0x170 [...] ... [...] Workqueue: events pwq_unbound_release_workfn [...] RIP: 0010:check_flags.part.0+0x165/0x170 [...] ... [...] Call Trace: [...] lock_is_held_type+0x72/0x150 [...] ? lock_acquire+0x16e/0x4a0 [...] rcu_read_lock_sched_held+0x3f/0x80 [...] __send_ipi_one+0x14d/0x1b0 [...] hv_send_ipi+0x12/0x30 [...] __pv_queued_spin_unlock_slowpath+0xd1/0x110 [...] __raw_callee_save___pv_queued_spin_unlock_slowpath+0x11/0x20 [...] .slowpath+0x9/0xe [...] lockdep_unregister_key+0x128/0x180 [...] pwq_unbound_release_workfn+0xbb/0xf0 [...] process_one_work+0x227/0x5c0 [...] worker_thread+0x55/0x3c0 [...] ? process_one_work+0x5c0/0x5c0 [...] kthread+0x153/0x170 [...] ? __kthread_bind_mask+0x60/0x60 [...] ret_from_fork+0x1f/0x30 The cause of the problem is we have call chain lockdep_unregister_key() -> <irq disabled by raw_local_irq_save()> lockdep_unlock() -> arch_spin_unlock() -> __pv_queued_spin_unlock_slowpath() -> pv_kick() -> __send_ipi_one() -> trace_hyperv_send_ipi_one(). Although this particular warning is triggered because Hyper-V has a trace point in ipi sending, but in general arch_spin_unlock() may call another function having a trace point in it, so put the arch_spin_lock() and arch_spin_unlock() after lock_recursion protection to fix this problem and avoid similiar problems. Signed-off-by: Boqun Feng <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-17sched/deadline: Fix priority inheritance with multiple scheduling classesJuri Lelli3-50/+68
Glenn reported that "an application [he developed produces] a BUG in deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested PTHREAD_PRIO_INHERIT mutexes. I believe the bug is triggered when a CFS task that was boosted by a SCHED_DEADLINE task boosts another CFS task (nested priority inheritance). ------------[ cut here ]------------ kernel BUG at kernel/sched/deadline.c:1462! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ... Hardware name: ... RIP: 0010:enqueue_task_dl+0x335/0x910 Code: ... RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600 FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ? intel_pstate_update_util_hwp+0x13/0x170 rt_mutex_setprio+0x1cc/0x4b0 task_blocks_on_rt_mutex+0x225/0x260 rt_spin_lock_slowlock_locked+0xab/0x2d0 rt_spin_lock_slowlock+0x50/0x80 hrtimer_grab_expiry_lock+0x20/0x30 hrtimer_cancel+0x13/0x30 do_nanosleep+0xa0/0x150 hrtimer_nanosleep+0xe1/0x230 ? __hrtimer_init_sleeper+0x60/0x60 __x64_sys_nanosleep+0x8d/0xa0 do_syscall_64+0x4a/0x100 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fa58b52330d ... ---[ end trace 0000000000000002 ]— He also provided a simple reproducer creating the situation below: So the execution order of locking steps are the following (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2 are mutexes that are enabled * with priority inheritance.) Time moves forward as this timeline goes down: N1 N2 D1 | | | | | | Lock(M1) | | | | | | Lock(M2) | | | | | | Lock(M2) | | | | Lock(M1) | | (!!bug triggered!) | Daniel reported a similar situation as well, by just letting ksoftirqd run with DEADLINE (and eventually block on a mutex). Problem is that boosted entities (Priority Inheritance) use static DEADLINE parameters of the top priority waiter. However, there might be cases where top waiter could be a non-DEADLINE entity that is currently boosted by a DEADLINE entity from a different lock chain (i.e., nested priority chains involving entities of non-DEADLINE classes). In this case, top waiter static DEADLINE parameters could be null (initialized to 0 at fork()) and replenish_dl_entity() would hit a BUG(). Fix this by keeping track of the original donor and using its parameters when a task is boosted. Reported-by: Glenn Elliott <[email protected]> Reported-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Daniel Bristot de Oliveira <[email protected]> Link: https://lkml.kernel.org/r/[email protected]