aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-09-21nvme-tcp: fix incorrect h2cdata pdu offset accountingSagi Grimberg1-3/+10
When the controller sends us multiple r2t PDUs in a single request we need to account for it correctly as our send/recv context run concurrently (i.e. we get a new r2t with r2t_offset before we updated our iterator and req->data_sent marker). This can cause wrong offsets to be sent to the controller. To fix that, we will first know that this may happen only in the send sequence of the last page, hence we will take the r2t_offset to the h2c PDU data_offset, and in nvme_tcp_try_send_data loop, we make sure to increment the request markers also when we completed a PDU but we are expecting more r2t PDUs as we still did not send the entire data of the request. Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion") Reported-by: Nowak, Lukasz <[email protected]> Tested-by: Nowak, Lukasz <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-09-21nvme-fc: remove freeze/unfreeze around update_nr_hw_queuesJames Smart1-2/+0
Remove the freeze/unfreeze around changes to the number of hardware queues. Study and retest has indicated there are no ios that can be active at this point so there is nothing to freeze. nvme-fc is draining the queues in the shutdown and error recovery path in __nvme_fc_abort_outstanding_ios. This patch primarily reverts 88e837ed0f1f "nvme-fc: wait for queues to freeze before calling update_hr_hw_queues". It's not an exact revert as it leaves the adjusting of hw queues only if the count changes. Signed-off-by: James Smart <[email protected]> [dwagner: added explanation why no IO is pending] Signed-off-by: Daniel Wagner <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-09-21nvme-fc: avoid race between time out and tear downJames Smart1-0/+2
To avoid race between time out and tear down, in tear down process, first we quiesce the queue, and then delete the timer and cancel the time out work for the queue. This patch merges the admin and io sync ops into the queue teardown logic as shown in the RDMA patch 3017013dcc "nvme-rdma: avoid race between time out and tear down". There is no teardown_lock in nvme-fc. Signed-off-by: James Smart <[email protected]> Tested-by: Daniel Wagner <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-09-21nvme-fc: update hardware queues before using themDaniel Wagner1-8/+8
In case the number of hardware queues changes, we need to update the tagset and the mapping of ctx to hctx first. If we try to create and connect the I/O queues first, this operation will fail (target will reject the connect call due to the wrong number of queues) and hence we bail out of the recreate function. Then we will to try the very same operation again, thus we don't make any progress. Signed-off-by: Daniel Wagner <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: James Smart <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-09-21debugfs: debugfs_create_file_size(): use IS_ERR to check for errorNirmoy Das1-1/+1
debugfs_create_file() returns encoded error so use IS_ERR for checking return value. Reviewed-by: Christian König <[email protected]> Signed-off-by: Nirmoy Das <[email protected]> Fixes: ff9fb72bc077 ("debugfs: return error values, not NULL") Cc: stable <[email protected]> References: https://gitlab.freedesktop.org/drm/amd/-/issues/1686 Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2021-09-21netfilter: conntrack: serialize hash resizes and cleanupsEric Dumazet1-33/+37
Syzbot was able to trigger the following warning [1] No repro found by syzbot yet but I was able to trigger similar issue by having 2 scripts running in parallel, changing conntrack hash sizes, and: for j in `seq 1 1000` ; do unshare -n /bin/true >/dev/null ; done It would take more than 5 minutes for net_namespace structures to be cleaned up. This is because nf_ct_iterate_cleanup() has to restart everytime a resize happened. By adding a mutex, we can serialize hash resizes and cleanups and also make get_next_corpse() faster by skipping over empty buckets. Even without resizes in the picture, this patch considerably speeds up network namespace dismantles. [1] INFO: task syz-executor.0:8312 can't die for more than 144 seconds. task:syz-executor.0 state:R running task stack:25672 pid: 8312 ppid: 6573 flags:0x00004006 Call Trace: context_switch kernel/sched/core.c:4955 [inline] __schedule+0x940/0x26f0 kernel/sched/core.c:6236 preempt_schedule_common+0x45/0xc0 kernel/sched/core.c:6408 preempt_schedule_thunk+0x16/0x18 arch/x86/entry/thunk_64.S:35 __local_bh_enable_ip+0x109/0x120 kernel/softirq.c:390 local_bh_enable include/linux/bottom_half.h:32 [inline] get_next_corpse net/netfilter/nf_conntrack_core.c:2252 [inline] nf_ct_iterate_cleanup+0x15a/0x450 net/netfilter/nf_conntrack_core.c:2275 nf_conntrack_cleanup_net_list+0x14c/0x4f0 net/netfilter/nf_conntrack_core.c:2469 ops_exit_list+0x10d/0x160 net/core/net_namespace.c:171 setup_net+0x639/0xa30 net/core/net_namespace.c:349 copy_net_ns+0x319/0x760 net/core/net_namespace.c:470 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226 ksys_unshare+0x445/0x920 kernel/fork.c:3128 __do_sys_unshare kernel/fork.c:3202 [inline] __se_sys_unshare kernel/fork.c:3200 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3200 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f63da68e739 RSP: 002b:00007f63d7c05188 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f63da792f80 RCX: 00007f63da68e739 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000000 RBP: 00007f63da6e8cc4 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f63da792f80 R13: 00007fff50b75d3f R14: 00007f63d7c05300 R15: 0000000000022000 Showing all locks held in the system: 1 lock held by khungtaskd/27: #0: ffffffff8b980020 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6446 2 locks held by kworker/u4:2/153: #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic_long_set include/linux/atomic/atomic-long.h:41 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: atomic_long_set include/linux/atomic/atomic-instrumented.h:1198 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:634 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:661 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x896/0x1690 kernel/workqueue.c:2268 #1: ffffc9000140fdb0 ((kfence_timer).work){+.+.}-{0:0}, at: process_one_work+0x8ca/0x1690 kernel/workqueue.c:2272 1 lock held by systemd-udevd/2970: 1 lock held by in:imklog/6258: #0: ffff88807f970ff0 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:990 3 locks held by kworker/1:6/8158: 1 lock held by syz-executor.0/8312: 2 locks held by kworker/u4:13/9320: 1 lock held by syz-executor.5/10178: 1 lock held by syz-executor.4/10217: Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-21netfilter: log: work around missing softdep backend moduleFlorian Westphal3-3/+34
iptables/nftables has two types of log modules: 1. backend, e.g. nf_log_syslog, which implement the functionality 2. frontend, e.g. xt_LOG or nft_log, which call the functionality provided by backend based on nf_tables or xtables rule set. Problem is that the request_module() call to load the backed in nf_logger_find_get() might happen with nftables transaction mutex held in case the call path is via nf_tables/nft_compat. This can cause deadlocks (see 'Fixes' tags for details). The chosen solution as to let modprobe deal with this by adding 'pre: ' soft dep tag to xt_LOG (to load the syslog backend) and xt_NFLOG (to load nflog backend). Eric reports that this breaks on systems with older modprobe that doesn't support softdeps. Another, similar issue occurs when someone either insmods xt_(NF)LOG directly or unloads the backend module (possible if no log frontend is in use): because the frontend module is already loaded, modprobe is not invoked again so the softdep isn't evaluated. Add a workaround: If nf_logger_find_get() returns -ENOENT and call is not via nft_compat, load the backend explicitly and try again. Else, let nft_compat ask for deferred request_module via nf_tables infra. Softdeps are kept in-place, so with newer modprobe the dependencies are resolved from userspace. Fixes: cefa31a9d461 ("netfilter: nft_log: perform module load from nf_tables") Fixes: a38b5b56d6f4 ("netfilter: nf_log: add module softdeps") Reported-and-tested-by: Eric Dumazet <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-21netfilter: iptable_raw: drop bogus net_init annotationFlorian Westphal1-1/+1
This is a leftover from the times when this function was wired up via pernet_operations. Now its called when userspace asks for the table. With CONFIG_NET_NS=n, iptable_raw_table_init memory has been discarded already and we get a kernel crash. Other tables are fine, __net_init annotation was removed already. Fixes: fdacd57c79b7 ("netfilter: x_tables: never register tables by default") Reported-by: youling 257 <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-21netfilter: nf_nat_masquerade: defer conntrack walk to work queueFlorian Westphal1-26/+24
The ipv4 and device notifiers are called with RTNL mutex held. The table walk can take some time, better not block other RTNL users. 'ip a' has been reported to block for up to 20 seconds when conntrack table has many entries and device down events are frequent (e.g., PPP). Reported-and-tested-by: Martin Zaharinov <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-21netfilter: nf_nat_masquerade: make async masq_inet6_event handling genericFlorian Westphal1-47/+75
masq_inet6_event is called asynchronously from system work queue, because the inet6 notifier is atomic and nf_iterate_cleanup can sleep. The ipv4 and device notifiers call nf_iterate_cleanup directly. This is legal, but these notifiers are called with RTNL mutex held. A large conntrack table with many devices coming and going will have severe impact on the system usability, with 'ip a' blocking for several seconds. This change places the defer code into a helper and makes it more generic so ipv4 and ifdown notifiers can be converted to defer the cleanup walk as well in a follow patch. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-21netfilter: nf_tables: Fix oversized kvmalloc() callsPablo Neira Ayuso1-1/+1
The commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls") limits the max allocatable memory via kvmalloc() to MAX_INT. Reported-by: [email protected] Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-21netfilter: nf_tables: unlink table before deleting itFlorian Westphal1-10/+18
syzbot reports following UAF: BUG: KASAN: use-after-free in memcmp+0x18f/0x1c0 lib/string.c:955 nla_strcmp+0xf2/0x130 lib/nlattr.c:836 nft_table_lookup.part.0+0x1a2/0x460 net/netfilter/nf_tables_api.c:570 nft_table_lookup net/netfilter/nf_tables_api.c:4064 [inline] nf_tables_getset+0x1b3/0x860 net/netfilter/nf_tables_api.c:4064 nfnetlink_rcv_msg+0x659/0x13f0 net/netfilter/nfnetlink.c:285 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 Problem is that all get operations are lockless, so the commit_mutex held by nft_rcv_nl_event() isn't enough to stop a parallel GET request from doing read-accesses to the table object even after synchronize_rcu(). To avoid this, unlink the table first and store the table objects in on-stack scratch space. Fixes: 6001a930ce03 ("netfilter: nftables: introduce table ownership") Reported-and-tested-by: [email protected] Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-21selftests: netfilter: add zone stress test with colliding tuplesFlorian Westphal1-0/+156
Add 20k entries to the connection tracking table, once from the data plane, once via ctnetlink. In both cases, each entry lives in a different conntrack zone and addresses/ports are identical. Expectation is that insertions work and occurs in constant time: PASS: added 10000 entries in 1215 ms (now 10000 total, loop 1) PASS: added 10000 entries in 1214 ms (now 20000 total, loop 2) PASS: inserted 20000 entries from packet path in 2434 ms total PASS: added 10000 entries in 57631 ms (now 10000 total) PASS: added 10000 entries in 58572 ms (now 20000 total) PASS: inserted 20000 entries via ctnetlink in 116205 ms Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-21selftests: netfilter: add selftest for directional zone supportFlorian Westphal1-0/+309
Add a script to exercise NAT port clash resolution with directional zones. Add net namespaces that use the same IP address and connect them to a gateway. Gateway uses policy routing based on iif/mark and conntrack zones to isolate the client namespaces. In server direction, same zone with NAT to single address is used. Then, connect to a server from each client netns, using identical connection id, i.e. saddr:sport -> daddr:dport. Expectation is for all connections to succeeed: NAT gatway is supposed to do port reallocation for each of the (clashing) connections. This is based on the description/use case provided in the commit message of deedb59039f111 ("netfilter: nf_conntrack: add direction support for zones"). Cc: Daniel Borkmann <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-21netfilter: nat: include zone id in nat table hash againFlorian Westphal1-5/+12
Similar to the conntrack change, also use the zone id for the nat source lists if the zone id is valid in both directions. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-21netfilter: conntrack: include zone id in tuple hash againFlorian Westphal1-15/+52
commit deedb59039f111 ("netfilter: nf_conntrack: add direction support for zones") removed the zone id from the hash value. This has implications on hash chain lengths with overlapping tuples, which can hit 64k entries on released kernels, before upper droplimit was added in d7e7747ac5c ("netfilter: refuse insertion if chain has grown too large"). With that change reverted, test script coming with this series shows linear insertion time growth: 10000 entries in 3737 ms (now 10000 total, loop 1) 10000 entries in 16994 ms (now 20000 total, loop 2) 10000 entries in 47787 ms (now 30000 total, loop 3) 10000 entries in 72731 ms (now 40000 total, loop 4) 10000 entries in 95761 ms (now 50000 total, loop 5) 10000 entries in 96809 ms (now 60000 total, loop 6) inserted 60000 entries from packet path in 333825 ms With d7e7747ac5c in place, the test fails. There are three supported zone use cases: 1. Connection is in the default zone (zone 0). This means to special config (the default). 2. Connection is in a different zone (1 to 2**16). This means rules are in place to put packets in the desired zone, e.g. derived from vlan id or interface. 3. Original direction is in zone X and Reply is in zone 0. 3) allows to use of the existing NAT port collision avoidance to provide connectivity to internet/wan even when the various zones have overlapping source networks separated via policy routing. In case the original zone is 0 all three cases are identical. There is no way to place original direction in zone x and reply in zone y (with y != 0). Zones need to be assigned manually via the iptables/nftables ruleset, before conntrack lookup occurs (raw table in iptables) using the "CT" target conntrack template support (-j CT --{zone,zone-orig,zone-reply} X). Normally zone assignment happens based on incoming interface, but could also be derived from packet mark, vlan id and so on. This means that when case 3 is used, the ruleset will typically not even assign a connection tracking template to the "reply" packets, so lookup happens in zone 0. However, it is possible that reply packets also match a ct zone assignment rule which sets up a template for zone X (X > 0) in original direction only. Therefore, after making the zone id part of the hash, we need to do a second lookup using the reply zone id if we did not find an entry on the first lookup. In practice, most deployments will either not use zones at all or the origin and reply zones are the same, no second lookup is required in either case. After this change, packet path insertion test passes with constant insertion times: 10000 entries in 1064 ms (now 10000 total, loop 1) 10000 entries in 1074 ms (now 20000 total, loop 2) 10000 entries in 1066 ms (now 30000 total, loop 3) 10000 entries in 1079 ms (now 40000 total, loop 4) 10000 entries in 1081 ms (now 50000 total, loop 5) 10000 entries in 1082 ms (now 60000 total, loop 6) inserted 60000 entries from packet path in 6452 ms Cc: Daniel Borkmann <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-21netfilter: conntrack: make max chain length randomFlorian Westphal1-6/+11
Similar to commit 67d6d681e15b ("ipv4: make exception cache less predictible"): Use a random drop length to make it harder to detect when entries were hashed to same bucket list. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-09-20Merge tag 'afs-fixes-20210913' of ↵Linus Torvalds17-120/+294
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: "Fixes for AFS problems that can cause data corruption due to interaction with another client modifying data cached locally: - When d_revalidating a dentry, don't look at the inode to which it points. Only check the directory to which the dentry belongs. This was confusing things and causing the silly-rename cleanup code to remove the file now at the dentry of a file that got deleted. - Fix mmap data coherency. When a callback break is received that relates to a file that we have cached, the data content may have been changed (there are other reasons, such as the user's rights having been changed). However, we're checking it lazily, only on entry to the kernel, which doesn't happen if we have a writeable shared mapped page on that file. We make the kernel keep track of mmapped files and clear all PTEs mapping to that file as soon as the callback comes in by calling unmap_mapping_pages() (we don't necessarily want to zap the pagecache). This causes the kernel to be reentered when userspace tries to access the mmapped address range again - and at that point we can query the server and, if we need to, zap the page cache. Ideally, I would check each file at the point of notification, but that involves poking the server[*] - which is holding an exclusive lock on the vnode it is changing, waiting for all the clients it notified to reply. This could then deadlock against the server. Further, invalidating the pagecache might call ->launder_page(), which would try to write to the file, which would definitely deadlock. (AFS doesn't lease file access). [*] Checking to see if the file content has changed is a matter of comparing the current data version number, but we have to ask the server for that. We also need to get a new callback promise and we need to poke the server for that too. - Add some more points at which the inode is validated, since we're doing it lazily, notably in ->read_iter() and ->page_mkwrite(), but also when performing some directory operations. Ideally, checking in ->read_iter() would be done in some derivation of filemap_read(). If we're going to call the server to read the file, then we get the file status fetch as part of that. - The above is now causing us to make a lot more calls to afs_validate() to check the inode - and afs_validate() takes the RCU read lock each time to make a quick check (ie. afs_check_validity()). This is entirely for the purpose of checking cb_s_break to see if the server we're using reinitialised its list of callbacks - however this isn't a very common event, so most of the time we're taking this needlessly. Add a new cell-wide counter to count the number of reinitialisations done by any server and check that - and only if that changes, take the RCU read lock and check the server list (the server list may change, but the cell a file is part of won't). - Don't update vnode->cb_s_break and ->cb_v_break inside the validity checking loop. The cb_lock is done with read_seqretry, so we might go round the loop a second time after resetting those values - and that could cause someone else checking validity to miss something (I think). Also included are patches for fixes for some bugs encountered whilst debugging this: - Fix a leak of afs_read objects and fix a leak of keys hidden by that. - Fix a leak of pages that couldn't be added to extend a writeback. - Fix the maintenance of i_blocks when i_size is changed by a local write or a local dir edit" Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217 [1] Link: https://lore.kernel.org/r/163111665183.283156.17200205573146438918.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163113612442.352844.11162345591911691150.stgit@warthog.procyon.org.uk/ # i_blocks patch * tag 'afs-fixes-20210913' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix updating of i_blocks on file/dir extension afs: Fix corruption in reads at fpos 2G-4G from an OpenAFS server afs: Try to avoid taking RCU read lock when checking vnode validity afs: Fix mmap coherency vs 3rd-party changes afs: Fix incorrect triggering of sillyrename on 3rd-party invalidation afs: Add missing vnode validation checks afs: Fix page leak afs: Fix missing put on afs_read objects and missing get on the key therein
2021-09-20Merge tag '5.15-rc1-ksmbd' of git://git.samba.org/ksmbdLinus Torvalds4-17/+81
Pull ksmbd server fixes from Steve French: "Three ksmbd fixes, including an important security fix for path processing, and a buffer overflow check, and a trivial fix for incorrect header inclusion" * tag '5.15-rc1-ksmbd' of git://git.samba.org/ksmbd: ksmbd: add validation for FILE_FULL_EA_INFORMATION of smb2_get_info ksmbd: prevent out of share access ksmbd: transport_rdma: Don't include rwlock.h directly
2021-09-20Merge tag '5.15-rc1-smb3' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds48-57/+67
Pull cifs client fixes from Steve French: - two deferred close fixes (for bugs found with xfstests 478 and 461) - a deferred close improvement in rename - two trivial fixes for incorrect Linux comment formatting of multiple cifs files (pointed out by automated kernel test robot and checkpatch) * tag '5.15-rc1-smb3' of git://git.samba.org/sfrench/cifs-2.6: cifs: Not to defer close on file when lock is set cifs: Fix soft lockup during fsstress cifs: Deferred close performance improvements cifs: fix incorrect kernel doc comments cifs: remove pathname for file from SPDX header
2021-09-20x86/fault: Fix wrong signal when vsyscall fails with pkeyJiashuo Liang3-10/+20
The function __bad_area_nosemaphore() calls kernelmode_fixup_or_oops() with the parameter @signal being actually @pkey, which will send a signal numbered with the argument in @pkey. This bug can be triggered when the kernel fails to access user-given memory pages that are protected by a pkey, so it can go down the do_user_addr_fault() path and pass the !user_mode() check in __bad_area_nosemaphore(). Most cases will simply run the kernel fixup code to make an -EFAULT. But when another condition current->thread.sig_on_uaccess_err is met, which is only used to emulate vsyscall, the kernel will generate the wrong signal. Add a new parameter @pkey to kernelmode_fixup_or_oops() to fix this. [ bp: Massage commit message, fix build error as reported by the 0day bot: https://lkml.kernel.org/r/[email protected] ] Fixes: 5042d40a264c ("x86/fault: Bypass no_context() for implicit kernel faults from usermode") Reported-by: kernel test robot <[email protected]> Signed-off-by: Jiashuo Liang <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-09-20Merge tag 'spi-fix-v5.15-rc2' of ↵Linus Torvalds2-5/+10
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark BrownL "This contains a couple of fixes, one fix for handling of zero length transfers on Rockchip devices and a warning fix which will conflict with a version you did but cleans up some extra unneeded forward declarations as well which seems a bit neater" * tag 'spi-fix-v5.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: tegra20-slink: Declare runtime suspend and resume functions conditionally spi: rockchip: handle zero length transfers without timing out
2021-09-20Merge tag 'regulator-fix-v5.15-rc2' of ↵Linus Torvalds2-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "A couple of small device specific fixes that have been sent since the merge window, neither of which stands out particularly" * tag 'regulator-fix-v5.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: max14577: Revert "regulator: max14577: Add proper module aliases strings" regulator: qcom-rpmh-regulator: fix pm8009-1 ldo7 resource name
2021-09-20drm/nouveau/nvkm: Replace -ENOSYS with -ENODEVGuenter Roeck1-1/+1
nvkm test builds fail with the following error. drivers/gpu/drm/nouveau/nvkm/engine/device/ctrl.c: In function 'nvkm_control_mthd_pstate_info': drivers/gpu/drm/nouveau/nvkm/engine/device/ctrl.c:60:35: error: overflow in conversion from 'int' to '__s8' {aka 'signed char'} changes value from '-251' to '5' The code builds on most architectures, but fails on parisc where ENOSYS is defined as 251. Replace the error code with -ENODEV (-19). The actual error code does not really matter and is not passed to userspace - it just has to be negative. Fixes: 7238eca4cf18 ("drm/nouveau: expose pstate selection per-power source in sysfs") Signed-off-by: Guenter Roeck <[email protected]> Cc: Ben Skeggs <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-20sparc64: fix pci_iounmap() when CONFIG_PCI is not setLinus Torvalds1-0/+2
Guenter reported [1] that the pci_iounmap() changes remain problematic, with sparc64 allnoconfig and tinyconfig still not building due to the header file changes and confusion with the arch-specific pci_iounmap() implementation. I'm pretty convinced that sparc should just use GENERIC_IOMAP instead of doing its own thing, since it turns out that the sparc64 version of pci_iounmap() is somewhat buggy (see [2]). But in the meantime, this just fixes the build by avoiding the trivial re-definition of the empty case. Link: https://lore.kernel.org/lkml/[email protected]/ [1] Link: https://lore.kernel.org/lkml/CAHk-=wgheheFx9myQyy5osh79BAazvmvYURAtub2gQtMvLrhqQ@mail.gmail.com/ [2] Reported-by: Guenter Roeck <[email protected]> Cc: David Miller <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-20RDMA/irdma: Report correct WC error when there are MW bind errorsSindhu Devale3-0/+8
Report the correct WC error when MW bind error related asynchronous events are generated by HW. Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sindhu Devale <[email protected]> Signed-off-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-09-20RDMA/irdma: Report correct WC error when transport retry counter is exceededSindhu Devale3-0/+6
When the retry counter exceeds, as the remote QP didn't send any Ack or Nack an asynchronous event (AE) for too many retries is generated. Add code to handle the AE and set the correct IB WC error code IB_WC_RETRY_EXC_ERR. Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sindhu Devale <[email protected]> Signed-off-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-09-20RDMA/irdma: Validate number of CQ entries on create CQSindhu Devale1-1/+1
Add lower bound check for CQ entries at creation time. Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sindhu Devale <[email protected]> Signed-off-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-09-20RDMA/irdma: Skip CQP ring during a resetSindhu Devale6-10/+8
Due to duplicate reset flags, CQP commands are processed during reset. This leads CQP failures such as below: irdma0: [Delete Local MAC Entry Cmd Error][op_code=49] status=-27 waiting=1 completion_err=0 maj=0x0 min=0x0 Remove the redundant flag and set the correct reset flag so CPQ is paused during reset Fixes: 8498a30e1b94 ("RDMA/irdma: Register auxiliary driver and implement private channel OPs") Link: https://lore.kernel.org/r/[email protected] Reported-by: LiLiang <[email protected]> Signed-off-by: Sindhu Devale <[email protected]> Signed-off-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-09-20MAINTAINERS: Update Broadcom RDMA maintainersSelvin Xavier1-1/+0
Updating the bnxt_re maintainers as Naresh decided to leave Broadcom. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Selvin Xavier <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-09-20swiotlb-xen: this is PV-only on x86Jan Beulich3-13/+1
The code is unreachable for HVM or PVH, and it also makes little sense in auto-translated environments. On Arm, with xen_{create,destroy}_contiguous_region() both being stubs, I have a hard time seeing what good the Xen specific variant does - the generic one ought to be fine for all purposes there. Still Arm code explicitly references symbols here, so the code will continue to be included there. Instead of making PCI_XEN's "select" conditional, simply drop it - SWIOTLB_XEN will be available unconditionally in the PV case anyway, and is - as explained above - dead code in non-PV environments. This in turn allows dropping the stubs for xen_{create,destroy}_contiguous_region(), the former of which was broken anyway - it failed to set the DMA handle output. Signed-off-by: Jan Beulich <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Stefano Stabellini <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>
2021-09-20xen/pci-swiotlb: reduce visibility of symbolsJan Beulich2-7/+3
xen_swiotlb and pci_xen_swiotlb_init() are only used within the file defining them, so make them static and remove the stubs. Otoh pci_xen_swiotlb_detect() has a use (as function pointer) from the main pci-swiotlb.c file - convert its stub to a #define to NULL. Signed-off-by: Jan Beulich <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>
2021-09-20PCI: only build xen-pcifront in PV-enabled environmentsJan Beulich1-1/+1
The driver's module init function, pcifront_init(), invokes xen_pv_domain() first thing. That construct produces constant "false" when !CONFIG_XEN_PV. Hence there's no point building the driver in non-PV configurations. Drop the (now implicit and generally wrong) X86 dependency: At present, XEN_PV can only be set when X86 is also enabled. In general an architecture supporting Xen PV (and PCI) would want to have this driver built. Signed-off-by: Jan Beulich <[email protected]> Reviewed-by: Stefano Stabellini <[email protected]> Acked-by: Bjorn Helgaas <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>
2021-09-20swiotlb-xen: ensure to issue well-formed XENMEM_exchange requestsJan Beulich1-3/+4
While the hypervisor hasn't been enforcing this, we would still better avoid issuing requests with GFNs not aligned to the requested order. Instead of altering the value also in the call to panic(), drop it there for being static and hence easy to determine without being part of the panic message. Signed-off-by: Jan Beulich <[email protected]> Reviewed-by: Stefano Stabellini <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>
2021-09-20Xen/gntdev: don't ignore kernel unmapping errorJan Beulich1-0/+8
While working on XSA-361 and its follow-ups, I failed to spot another place where the kernel mapping part of an operation was not treated the same as the user space part. Detect and propagate errors and add a 2nd pr_debug(). Signed-off-by: Jan Beulich <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>
2021-09-20xen/x86: drop redundant zeroing from cpu_initialize_context()Jan Beulich1-4/+0
Just after having obtained the pointer from kzalloc() there's no reason at all to set part of the area to all zero yet another time. Similarly there's no point explicitly clearing "ldt_ents". Signed-off-by: Jan Beulich <[email protected]> Reviewed-by: Boris Ostrovsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>
2021-09-20Merge branch 'hns3-fixes'David S. Miller5-49/+103
Guangbin Huang says: ==================== net: hns3: add some fixes for -net This series adds some fixes for the HNS3 ethernet driver. ==================== Signed-off-by: David S. Miller <[email protected]>
2021-09-20net: hns3: fix a return value error in hclge_get_reset_status()Yufeng Mo1-4/+11
hclge_get_reset_status() should return the tqp reset status. However, if the CMDQ fails, the caller will take it as tqp reset success status by mistake. Therefore, uses a parameters to get the tqp reset status instead. Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by: Yufeng Mo <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-20net: hns3: check vlan id before using itliaoguojia1-0/+3
The input parameters may not be reliable, so check the vlan id before using it, otherwise may set wrong vlan id into hardware. Fixes: dc8131d846d4 ("net: hns3: Fix for packet loss due wrong filter config in VLAN tbls") Signed-off-by: liaoguojia <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-20net: hns3: check queue id range before usingYufeng Mo1-0/+8
The input parameters may not be reliable. Before using the queue id, we should check this parameter. Otherwise, memory overwriting may occur. Fixes: d34100184685 ("net: hns3: refactor the mailbox message between PF and VF") Signed-off-by: Yufeng Mo <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-20net: hns3: fix misuse vf id and vport id in some logsJiaran Zhang4-10/+12
vport_id include PF and VFs, vport_id = 0 means PF, other values mean VFs. So the actual vf id is equal to vport_id minus 1. Some VF print logs are actually vport, and logs of vf id actually use vport id, so this patch fixes them. Fixes: ac887be5b0fe ("net: hns3: change print level of RAS error log from warning to error") Fixes: adcf738b804b ("net: hns3: cleanup some print format warning") Signed-off-by: Jiaran Zhang <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-20net: hns3: fix inconsistent vf id printJian Shen1-2/+5
The vf id from ethtool is added 1 before configured to driver. So it's necessary to minus 1 when printing it, in order to keep consistent with user's configuration. Fixes: dd74f815dd41 ("net: hns3: Add support for rule add/delete for flow director") Signed-off-by: Jian Shen <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-20net: hns3: fix change RSS 'hfunc' ineffective issueJian Shen2-33/+64
When user change rss 'hfunc' without set rss 'hkey' by ethtool -X command, the driver will ignore the 'hfunc' for the hkey is NULL. It's unreasonable. So fix it. Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Fixes: 374ad291762a ("net: hns3: Add RSS general configuration support for VF") Signed-off-by: Jian Shen <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-20KVM: arm64: Fix PMU probe orderingMarc Zyngier5-7/+16
Russell reported that since 5.13, KVM's probing of the PMU has started to fail on his HW. As it turns out, there is an implicit ordering dependency between the architectural PMU probing code and and KVM's own probing. If, due to probe ordering reasons, KVM probes before the PMU driver, it will fail to detect the PMU and prevent it from being advertised to guests as well as the VMM. Obviously, this is one probing too many, and we should be able to deal with any ordering. Add a callback from the PMU code into KVM to advertise the registration of a host CPU PMU, allowing for any probing order. Fixes: 5421db1be3b1 ("KVM: arm64: Divorce the perf code from oprofile helpers") Reported-by: "Russell King (Oracle)" <[email protected]> Tested-by: Russell King (Oracle) <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected] Cc: [email protected]
2021-09-20KVM: arm64: nvhe: Fix missing FORCE for hyp-reloc.S build ruleZenghui Yu1-1/+1
Add FORCE so that if_changed can detect the command line change. We'll otherwise see a compilation warning since commit e1f86d7b4b2a ("kbuild: warn if FORCE is missing for if_changed(_dep,_rule) and filechk"). arch/arm64/kvm/hyp/nvhe/Makefile:58: FORCE prerequisite is missing Cc: David Brazdil <[email protected]> Cc: Masahiro Yamada <[email protected]> Signed-off-by: Zenghui Yu <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-09-20staging: r8188eu: fix -Wrestrict warningsArnd Bergmann1-4/+4
Adding back the nonstandard ioctl commands caused -Wrestrict warnings when building with 'make W=1': drivers/staging/r8188eu/os_dep/ioctl_linux.c: In function 'rtw_mp_read_rf': drivers/staging/r8188eu/os_dep/ioctl_linux.c:5515:27: error: 'sprintf' argument 3 overlaps destination object 'extra' [-Werror=restrict] 5515 | sprintf(extra, "%s %d", extra, strtou); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/staging/r8188eu/os_dep/ioctl_linux.c:5470:54: note: destination object referenced by 'restrict'-qualified argument 1 was declared here 5470 | struct iw_point *wrqu, char *extra) | ~~~~~~^~~~~ Change these to the same construct used elsewhere in that driver, with an offset to the string to make the warning go away. The ioctl commands were previously removed, and it's unlikely that anything is actually using them, so ideally I would prefer to have them removed again. The lack of range checking of the 'extra' output buffer is also slightly worrying, but I did not check whether this could cause harm. Fixes: 2b42bd58b321 ("staging: r8188eu: introduce new os_dep dir for RTL8188eu driver") Signed-off-by: Arnd Bergmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2021-09-20ptp: ocp: add COMMON_CLK dependencyArnd Bergmann1-0/+1
Without CONFIG_COMMON_CLK, this fails to link: arm-linux-gnueabi-ld: drivers/ptp/ptp_ocp.o: in function `ptp_ocp_register_i2c': ptp_ocp.c:(.text+0xcc0): undefined reference to `__clk_hw_register_fixed_rate' arm-linux-gnueabi-ld: ptp_ocp.c:(.text+0xcf4): undefined reference to `devm_clk_hw_register_clkdev' arm-linux-gnueabi-ld: drivers/ptp/ptp_ocp.o: in function `ptp_ocp_detach': ptp_ocp.c:(.text+0x1c24): undefined reference to `clk_hw_unregister_fixed_rate' Fixes: a7e1abad13f3 ("ptp: Add clock driver for the OpenCompute TimeCard.") Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-09-20USB: serial: option: remove duplicate USB device IDKrzysztof Kozlowski1-1/+0
The device ZTE 0x0094 is already on the list. Signed-off-by: Krzysztof Kozlowski <[email protected]> Fixes: b9e44fe5ecda ("USB: option: cleanup zte 3g-dongle's pid in option.c") Cc: [email protected] Signed-off-by: Johan Hovold <[email protected]>
2021-09-20USB: serial: mos7840: remove duplicated 0xac24 device IDKrzysztof Kozlowski1-2/+0
0xac24 device ID is already defined and used via BANDB_DEVICE_ID_USO9ML2_4. Remove the duplicate from the list. Fixes: 27f1281d5f72 ("USB: serial: Extra device/vendor ID for mos7840 driver") Signed-off-by: Krzysztof Kozlowski <[email protected]> Cc: [email protected] Signed-off-by: Johan Hovold <[email protected]>
2021-09-20bnxt_en: Fix TX timeout when TX ring size is set to the smallestMichael Chan3-5/+10
The smallest TX ring size we support must fit a TX SKB with MAX_SKB_FRAGS + 1. Because the first TX BD for a packet is always a long TX BD, we need an extra TX BD to fit this packet. Define BNXT_MIN_TX_DESC_CNT with this value to make this more clear. The current code uses a minimum that is off by 1. Fix it using this constant. The tx_wake_thresh to determine when to wake up the TX queue is half the ring size but we must have at least BNXT_MIN_TX_DESC_CNT for the next packet which may have maximum fragments. So the comparison of the available TX BDs with tx_wake_thresh should be >= instead of > in the current code. Otherwise, at the smallest ring size, we will never wake up the TX queue and will cause TX timeout. Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.") Reviewed-by: Pavan Chebbi <[email protected]> Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>