aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-04-17Merge tag 'mlx5-updates-2023-04-14' of ↵David S. Miller15-89/+1025
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux mlx5-updates-2023-04-14 Yevgeny Kliteynik Says: ======================= SW Steering: Support pattern/args modify_header actions The following patch series adds support for a new pattern/arguments type of modify_header actions. Starting with ConnectX-6 DX, we use a new design of modify_header FW object. The current modify_header object allows for having only limited number of these FW objects, which means that we are limited in the number of offloaded flows that require modify_header action. The new approach comprises of two types of objects: pattern and argument. Pattern holds header modification templates, later used with corresponding argument object to create complete header modification actions. The pattern indicates which headers are modified, while the arguments provide the specific values. Therefore a single pattern can be used with different arguments in different flows, enabling offloading of large number of modify_header flows. - Patch 1, 2: Add ICM pool for modify-header-pattern objects and implement patterns cache, allowing patterns reuse for different flows - Patch 3: Allow for chunk allocation separately for STEv0 and STEv1 - Patch 4: Read related device capabilities - Patch 5: Add create/destroy functions for the new general object type - Patch 6: Add support for writing modify header argument to ICM - Patch 7, 8: Some required fixes to support pattern/arg - separate read buffer from the write buffer and fix QP continuous allocation - Patch 9: Add pool for modify header arg objects - Patch 10, 11, 12: Implement MODIFY_HEADER and TNL_L3_TO_L2 actions with the new patterns/args design - Patch 13: Optimization - set modify header action of size 1 directly on the STE instead of separate pattern/args combination - Patch 14: Adjust debug dump for patterns/args - Patch 15: Enable patterns and arguments for supporting devices =======================
2023-04-17Merge branch 'ovs-selftests'David S. Miller2-16/+1349
Aaron Conole says: ==================== selftests: openvswitch: add support for testing upcall interface The existing selftest suite for openvswitch will work for regression testing the datapath feature bits, but won't test things like adding interfaces, or the upcall interface. Here, we add some additional test facilities. First, extend the ovs-dpctl.py python module to support the OVS_FLOW and OVS_PACKET netlink families, with some associated messages. These can be extended over time, but the initial support is for more well known cases (output, userspace, and CT). Next, extend the test suite to test upcalls by adding a datapath, monitoring the upcall socket associated with the datapath, and then dumping any upcalls that are received. Compare with expected ARP upcall via arping. ==================== Signed-off-by: David S. Miller <[email protected]>
2023-04-17selftests: openvswitch: add support for upcall testingAaron Conole2-11/+165
The upcall socket interface can be exercised now to make sure that future feature adjustments to the field can maintain backwards compatibility. Signed-off-by: Aaron Conole <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-17selftests: openvswitch: add flow dump supportAaron Conole1-0/+1026
Add a basic set of fields to print in a 'dpflow' format. This will be used by future commits to check for flow fields after parsing, as well as verifying the flow fields pushed into the kernel from userspace. Signed-off-by: Aaron Conole <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-17selftests: openvswitch: add interface supportAaron Conole2-10/+163
Includes an associated test to generate netns and connect interfaces, with the option to include packet tracing. This will be used in the future when flow support is added for additional test cases. Signed-off-by: Aaron Conole <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-17net: phy: micrel: Fix PTP_PF_PEROUT for lan8841Horatiu Vultur1-1/+1
If the 1PPS output was enabled and then lan8841 was configured to be a follower, then target clock which is used to generate the 1PPS was not configure correctly. The problem was that for each adjustments of the time, also the nanosecond part of the target clock was changed. Therefore the initial nanosecond part of the target clock was changed. The issue can be observed if both the leader and the follower are generating 1PPS and see that their PPS are not aligned even if the time is allined. The fix consists of not modifying the nanosecond part of the target clock when adjusting the time. In this way the 1PPS get also aligned. Fixes: e4ed8ba08e3f ("net: phy: micrel: Add support for PTP_PF_PEROUT for lan8841") Signed-off-by: Horatiu Vultur <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-17virtio_net: bugfix overflow inside xdp_linearize_page()Xuan Zhuo1-2/+6
Here we copy the data from the original buf to the new page. But we not check that it may be overflow. As long as the size received(including vnethdr) is greater than 3840 (PAGE_SIZE -VIRTIO_XDP_HEADROOM). Then the memcpy will overflow. And this is completely possible, as long as the MTU is large, such as 4096. In our test environment, this will cause crash. Since crash is caused by the written memory, it is meaningless, so I do not include it. Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous buffers") Signed-off-by: Xuan Zhuo <[email protected]> Acked-by: Jason Wang <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-16Linux 6.3-rc7Linus Torvalds1-1/+1
2023-04-16Revert "userfaultfd: don't fail on unrecognized features"Peter Xu1-2/+4
This is a proposal to revert commit 914eedcb9ba0ff53c33808. I found this when writing a simple UFFDIO_API test to be the first unit test in this set. Two things breaks with the commit: - UFFDIO_API check was lost and missing. According to man page, the kernel should reject ioctl(UFFDIO_API) if uffdio_api.api != 0xaa. This check is needed if the api version will be extended in the future, or user app won't be able to identify which is a new kernel. - Feature flags checks were removed, which means UFFDIO_API with a feature that does not exist will also succeed. According to the man page, we should (and it makes sense) to reject ioctl(UFFDIO_API) if unknown features passed in. Link: https://lore.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 914eedcb9ba0 ("userfaultfd: don't fail on unrecognized features") Signed-off-by: Peter Xu <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Zach O'Keefe <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-16writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbsBaokun Li2-9/+20
KASAN report null-ptr-deref: ================================================================== BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0 Write of size 8 at addr 0000000000000000 by task sync/943 CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461 Call Trace: <TASK> dump_stack_lvl+0x7f/0xc0 print_report+0x2ba/0x340 kasan_report+0xc4/0x120 kasan_check_range+0x1b7/0x2e0 __kasan_check_write+0x24/0x40 bdi_split_work_to_wbs+0x5c5/0x7b0 sync_inodes_sb+0x195/0x630 sync_inodes_one_sb+0x3a/0x50 iterate_supers+0x106/0x1b0 ksys_sync+0x98/0x160 [...] ================================================================== The race that causes the above issue is as follows: cpu1 cpu2 -------------------------|------------------------- inode_switch_wbs INIT_WORK(&isw->work, inode_switch_wbs_work_fn) queue_rcu_work(isw_wq, &isw->work) // queue_work async inode_switch_wbs_work_fn wb_put_many(old_wb, nr_switched) percpu_ref_put_many ref->data->release(ref) cgwb_release queue_work(cgwb_release_wq, &wb->release_work) // queue_work async &wb->release_work cgwb_release_workfn ksys_sync iterate_supers sync_inodes_one_sb sync_inodes_sb bdi_split_work_to_wbs kmalloc(sizeof(*work), GFP_ATOMIC) // alloc memory failed percpu_ref_exit ref->data = NULL kfree(data) wb_get(wb) percpu_ref_get(&wb->refcnt) percpu_ref_get_many(ref, 1) atomic_long_add(nr, &ref->data->count) atomic64_add(i, v) // trigger null-ptr-deref bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs. If the allocation of new work fails, the on-stack fallback will be used and the reference count of the current wb is increased afterwards. If cgroup writeback membership switches occur before getting the reference count and the current wb is released as old_wd, then calling wb_get() or wb_put() will trigger the null pointer dereference above. This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb() and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger this issue. For scenarios called via sync_inodes_sb(), originally commit 7fc5854f8c6e ("writeback: synchronize sync(2) against cgroup writeback membership switches") reduced the possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes easily reproducible again. To solve this problem, percpu_ref_exit() is called under RCU protection to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), and skip the current wb if wb_tryget() fails because the wb has already been shutdown. Link: https://lkml.kernel.org/r/[email protected] Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones") Signed-off-by: Baokun Li <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Hou Tao <[email protected]> Cc: yangerkun <[email protected]> Cc: Zhang Yi <[email protected]> Cc: Jens Axboe <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-16maple_tree: fix a potential memory leak, OOB access, or other unpredictable bugPeng Zhang1-12/+7
In mas_alloc_nodes(), "node->node_count = 0" means to initialize the node_count field of the new node, but the node may not be a new node. It may be a node that existed before and node_count has a value, setting it to 0 will cause a memory leak. At this time, mas->alloc->total will be greater than the actual number of nodes in the linked list, which may cause many other errors. For example, out-of-bounds access in mas_pop_node(), and mas_pop_node() may return addresses that should not be used. Fix it by initializing node_count only for new nodes. Also, by the way, an if-else statement was removed to simplify the code. Link: https://lkml.kernel.org/r/[email protected] Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Peng Zhang <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-16tools/mm/page_owner_sort.c: fix TGID output when cull=tg is usedSteve Chou1-1/+1
When using cull option with 'tg' flag, the fprintf is using pid instead of tgid. It should use tgid instead. Link: https://lkml.kernel.org/r/[email protected] Fixes: 9c8a0a8e599f4a ("tools/vm/page_owner_sort.c: support for user-defined culling rules") Signed-off-by: Steve Chou <[email protected]> Cc: Jiajian Ye <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-16mailmap: update jtoppins' entry to reference correct emailJonathan Toppins1-0/+2
Link: https://lkml.kernel.org/r/d79bc6eaf65e68bd1c2a1e1510ab6291ce5926a6.1681162487.git.jtoppins@redhat.com Signed-off-by: Jonathan Toppins <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Konrad Dybcio <[email protected]> Cc: Qais Yousef <[email protected]> Cc: Stephen Hemminger <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-16mm/mempolicy: fix use-after-free of VMA iteratorLiam R. Howlett1-55/+49
set_mempolicy_home_node() iterates over a list of VMAs and calls mbind_range() on each VMA, which also iterates over the singular list of the VMA passed in and potentially splits the VMA. Since the VMA iterator is not passed through, set_mempolicy_home_node() may now point to a stale node in the VMA tree. This can result in a UAF as reported by syzbot. Avoid the stale maple tree node by passing the VMA iterator through to the underlying call to split_vma(). mbind_range() is also overly complicated, since there are two calling functions and one already handles iterating over the VMAs. Simplify mbind_range() to only handle merging and splitting of the VMAs. Align the new loop in do_mbind() and existing loop in set_mempolicy_home_node() to use the reduced mbind_range() function. This allows for a single location of the range calculation and avoids constantly looking up the previous VMA (since this is a loop over the VMAs). Link: https://lore.kernel.org/linux-mm/[email protected]/ Fixes: 66850be55e8e ("mm/mempolicy: use vma iterator & maple state instead of vma linked list") Signed-off-by: Liam R. Howlett <[email protected]> Reported-by: [email protected] Link: https://lkml.kernel.org/r/[email protected] Tested-by: [email protected] Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-16mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIONaoya Horiguchi1-2/+3
split_huge_page_to_list() WARNs when called for huge zero pages, which sounds to me too harsh because it does not imply a kernel bug, but just notifies the event to admins. On the other hand, this is considered as critical by syzkaller and makes its testing less efficient, which seems to me harmful. So replace the VM_WARN_ON_ONCE_FOLIO with pr_warn_ratelimited. Link: https://lkml.kernel.org/r/[email protected] Fixes: 478d134e9506 ("mm/huge_memory: do not overkill when splitting huge_zero_page") Signed-off-by: Naoya Horiguchi <[email protected]> Reported-by: [email protected] Link: https://lore.kernel.org/lkml/[email protected]/ Reviewed-by: Yang Shi <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Xu Yu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-16mm/mprotect: fix do_mprotect_pkey() return on errorLiam R. Howlett1-1/+1
When the loop over the VMA is terminated early due to an error, the return code could be overwritten with ENOMEM. Fix the return code by only setting the error on early loop termination when the error is not set. User-visible effects include: attempts to run mprotect() against a special mapping or with a poorly-aligned hugetlb address should return -EINVAL, but they presently return -ENOMEM. In other cases an -EACCESS should be returned. Link: https://lkml.kernel.org/r/[email protected] Fixes: 2286a6914c77 ("mm: change mprotect_fixup to vma iterator") Signed-off-by: Liam R. Howlett <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-16mm/khugepaged: check again on anon uffd-wp during isolationPeter Xu1-0/+4
Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(), during which all the locks will be released temporarily. It means the pgtable can change during this phase before 2nd round starts. It's logically possible some ptes got wr-protected during this phase, and we can errornously collapse a thp without noticing some ptes are wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did that for the 1st phase, not the 2nd phase. Since __collapse_huge_page_isolate() happens after a round of small page swapins, we don't need to worry on any !present ptes - if it existed khugepaged will already bail out. So we only need to check present ptes with uffd-wp bit set there. This is something I found only but never had a reproducer, I thought it was one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns out it's not the cause of that but an userspace bug. However this seems to still be a real bug even with a very small race window, still worth to have it fixed and copy stable. Link: https://lkml.kernel.org/r/[email protected] Fixes: e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected") Signed-off-by: Peter Xu <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-16mm/userfaultfd: fix uffd-wp handling for THP migration entriesDavid Hildenbrand1-2/+12
Looks like what we fixed for hugetlb in commit 44f86392bdd1 ("mm/hugetlb: fix uffd-wp handling for migration entries in hugetlb_change_protection()") similarly applies to THP. Setting/clearing uffd-wp on THP migration entries is not implemented properly. Further, while removing migration PMDs considers the uffd-wp bit, inserting migration PMDs does not consider the uffd-wp bit. We have to set/clear independently of the migration entry type in change_huge_pmd() and properly copy the uffd-wp bit in set_pmd_migration_entry(). Verified using a simple reproducer that triggers migration of a THP, that the set_pmd_migration_entry() no longer loses the uffd-wp bit. Link: https://lkml.kernel.org/r/[email protected] Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: <[email protected]> Cc: Muhammad Usama Anjum <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-16mm: swap: fix performance regression on sparsetruncate-tinyQi Zheng1-1/+1
The ->percpu_pvec_drained was originally introduced by commit d9ed0d08b6c6 ("mm: only drain per-cpu pagevecs once per pagevec usage") to drain per-cpu pagevecs only once per pagevec usage. But after converting the swap code to be more folio-based, the commit c2bc16817aa0 ("mm/swap: add folio_batch_move_lru()") breaks this logic, which would cause ->percpu_pvec_drained to be reset to false, that means per-cpu pagevecs will be drained multiple times per pagevec usage. In theory, there should be no functional changes when converting code to be more folio-based. We should call folio_batch_reinit() in folio_batch_move_lru() instead of folio_batch_init(). And to verify that we still need ->percpu_pvec_drained, I ran mmtests/sparsetruncate-tiny and got the following data: baseline with baseline/ patch/ Min Time 326.00 ( 0.00%) 328.00 ( -0.61%) 1st-qrtle Time 334.00 ( 0.00%) 336.00 ( -0.60%) 2nd-qrtle Time 338.00 ( 0.00%) 341.00 ( -0.89%) 3rd-qrtle Time 343.00 ( 0.00%) 347.00 ( -1.17%) Max-1 Time 326.00 ( 0.00%) 328.00 ( -0.61%) Max-5 Time 327.00 ( 0.00%) 330.00 ( -0.92%) Max-10 Time 328.00 ( 0.00%) 331.00 ( -0.91%) Max-90 Time 350.00 ( 0.00%) 357.00 ( -2.00%) Max-95 Time 395.00 ( 0.00%) 390.00 ( 1.27%) Max-99 Time 508.00 ( 0.00%) 434.00 ( 14.57%) Max Time 547.00 ( 0.00%) 476.00 ( 12.98%) Amean Time 344.61 ( 0.00%) 345.56 * -0.28%* Stddev Time 30.34 ( 0.00%) 19.51 ( 35.69%) CoeffVar Time 8.81 ( 0.00%) 5.65 ( 35.87%) BAmean-99 Time 342.38 ( 0.00%) 344.27 ( -0.55%) BAmean-95 Time 338.58 ( 0.00%) 341.87 ( -0.97%) BAmean-90 Time 336.89 ( 0.00%) 340.26 ( -1.00%) BAmean-75 Time 335.18 ( 0.00%) 338.40 ( -0.96%) BAmean-50 Time 332.54 ( 0.00%) 335.42 ( -0.87%) BAmean-25 Time 329.30 ( 0.00%) 332.00 ( -0.82%) From the above it can be seen that we get similar data to when ->percpu_pvec_drained was introduced, so we still need it. Let's call folio_batch_reinit() in folio_batch_move_lru() to restore the original logic. Link: https://lkml.kernel.org/r/[email protected] Fixes: c2bc16817aa0 ("mm/swap: add folio_batch_move_lru()") Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-16Merge tag 'sched_urgent_for_v6.3_rc7' of ↵Linus Torvalds1-0/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Do not pull tasks to the local scheduling group if its average load is higher than the average system load * tag 'sched_urgent_for_v6.3_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix imbalance overflow
2023-04-16Merge tag 'x86_urgent_for_v6.3_rc7' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: - Drop __init annotation from two rtc functions which get called after boot is done, in order to prevent a crash * tag 'x86_urgent_for_v6.3_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/rtc: Remove __init for runtime functions
2023-04-16Merge tag 'powerpc-6.3-5' of ↵Linus Torvalds2-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fix from Michael Ellerman: - A fix for NUMA distance handling in the pseries SCM (pmem) driver. Thanks to Aneesh Kumar K.V. * tag 'powerpc-6.3-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/papr_scm: Update the NUMA distance table for the target node
2023-04-16Merge tag 'kbuild-fixes-v6.3-3' of ↵Linus Torvalds9-139/+138
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Drop debug info from purgatory objects again - Document that kernel.org provides prebuilt LLVM toolchains - Give up handling untracked files for source package builds - Avoid creating corrupted cpio when KBUILD_BUILD_TIMESTAMP is given with a pre-epoch data. - Change panic_show_mem() to a macro to handle variable-length argument - Compress tarballs on-the-fly again * tag 'kbuild-fixes-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: do not create intermediate *.tar for tar packages kbuild: do not create intermediate *.tar for source tarballs kbuild: merge cmd_archive_linux and cmd_archive_perf init/initramfs: Fix argument forwarding to panic() in panic_show_mem() initramfs: Check negative timestamp to prevent broken cpio archive kbuild: give up untracked files for source package builds Documentation/llvm: Add a note about prebuilt kernel.org toolchains purgatory: fix disabling debug info
2023-04-16Merge tag '6.3-rc6-ksmbd-server-fix' of git://git.samba.org/ksmbdLinus Torvalds1-9/+14
Pull ksmbd server fix from Steve French: "smb311 server preauth integrity negotiate context parsing fix (check for out of bounds access)" * tag '6.3-rc6-ksmbd-server-fix' of git://git.samba.org/ksmbd: ksmbd: avoid out of bounds access in decode_preauth_ctxt()
2023-04-16kbuild: do not create intermediate *.tar for tar packagesMasahiro Yamada1-17/+9
Commit 05e96e96a315 ("kbuild: use git-archive for source package creation") split the compression as a separate step to factor out the common build rules. With the previous commit, we got back to the situation where source tarballs are compressed on-the-fly. There is no reason to keep the separate compression rules. Generate the comressed tar packages directly. Signed-off-by: Masahiro Yamada <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]>
2023-04-16kbuild: do not create intermediate *.tar for source tarballsMasahiro Yamada1-7/+17
Since commit 05e96e96a315 ("kbuild: use git-archive for source package creation"), a source tarball is created in two steps; create *.tar file then compress it. I split the compression as a separate rule because I just thought 'git archive' supported only gzip. For other compression algorithms, I could pipe the two commands: $ git archive HEAD | xz > linux.tar.xz I read git-archive(1) carefully, and I realized GIT had provided a more elegant way: $ git -c tar.tar.xz.command=xz archive -o linux.tar.xz HEAD This commit uses 'tar.tar.*.command' configuration to specify the compression backend so we can compress a source tarball on-the-fly. GIT commit 767cf4579f0e ("archive: implement configurable tar filters") is more than a decade old, so it should be available on almost all build environments. Signed-off-by: Masahiro Yamada <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]>
2023-04-16kbuild: merge cmd_archive_linux and cmd_archive_perfMasahiro Yamada1-10/+9
The two commands, cmd_archive_linux and cmd_archive_perf, are similar. Merge them to make it easier to add more changes to the git-archive command. Signed-off-by: Masahiro Yamada <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]>
2023-04-16init/initramfs: Fix argument forwarding to panic() in panic_show_mem()Benjamin Gray1-9/+2
Forwarding variadic argument lists can't be done by passing a va_list to a function with signature foo(...) (as panic() has). It ends up interpreting the va_list itself as a single argument instead of iterating it. printf() happily accepts it of course, leading to corrupt output. Convert panic_show_mem() to a macro to allow forwarding the arguments. The function is trivial enough that it's easier than trying to introduce a vpanic() variant. Signed-off-by: Benjamin Gray <[email protected]> Reviewed-by: Andrew Donnellan <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2023-04-16initramfs: Check negative timestamp to prevent broken cpio archiveBenjamin Gray1-3/+9
Similar to commit 4c9d410f32b3 ("initramfs: Check timestamp to prevent broken cpio archive"), except asserts that the timestamp is non-negative. This can happen when the KBUILD_BUILD_TIMESTAMP is a value before UNIX epoch, which may be set when making reproducible builds that don't want to look like they use a valid date. While support for dates before 1970 might not be supported, this is more about preventing undetected CPIO corruption. The printf's use a minimum length format specifier, and will happily make the field longer than 8 characters if they need to. Signed-off-by: Benjamin Gray <[email protected]> Reviewed-by: Andrew Donnellan <[email protected]> Tested-by: Andrew Donnellan <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2023-04-15Merge tag '6.3-rc6-smb311-client-negcontext-fix' of ↵Linus Torvalds1-10/+31
git://git.samba.org/sfrench/cifs-2.6 Pull cifs fix from Steve French: "Small client fix for better checking for smb311 negotiate context overflows, also marked for stable" * tag '6.3-rc6-smb311-client-negcontext-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix negotiate context parsing
2023-04-15Merge tag 'ubifs-for-linus-6.3-rc7' of ↵Linus Torvalds2-8/+17
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull UBI fixes from Richard Weinberger: - Fix failure to attach when vid_hdr offset equals the (sub)page size - Fix for a deadlock in UBI's worker thread * tag 'ubifs-for-linus-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubi: Fix failure attaching when vid_hdr offset equals to (sub)page size ubi: Fix deadlock caused by recursively holding work_sem
2023-04-15cifs: fix negotiate context parsingDavid Disseldorp1-10/+31
smb311_decode_neg_context() doesn't properly check against SMB packet boundaries prior to accessing individual negotiate context entries. This is due to the length check omitting the eight byte smb2_neg_context header, as well as incorrect decrementing of len_of_ctxts. Fixes: 5100d8a3fe03 ("SMB311: Improve checking of negotiate security contexts") Reported-by: Volker Lendecke <[email protected]> Reviewed-by: Paulo Alcantara (SUSE) <[email protected]> Signed-off-by: David Disseldorp <[email protected]> Signed-off-by: Steve French <[email protected]>
2023-04-15Merge tag 'i2c-for-6.3-rc7' of ↵Linus Torvalds2-46/+49
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "Just two driver fixes" * tag 'i2c-for-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: ocores: generate stop condition after timeout in polling mode i2c: mchp-pci1xxxx: Update Timing registers
2023-04-15Merge tag 'scsi-fixes' of ↵Linus Torvalds1-12/+8
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fix from James Bottomley: "One small fix to SCSI Enclosure Services to fix a regression caused by another recent fix" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: ses: Handle enclosure with just a primary component gracefully
2023-04-15Merge tag 'block-6.3-2023-04-14' of git://git.kernel.dk/linuxLinus Torvalds1-0/+2
Pull block fix from Jens Axboe: "A single NVMe quirk entry addition" * tag 'block-6.3-2023-04-14' of git://git.kernel.dk/linux: nvme-pci: add NVME_QUIRK_BOGUS_NID for T-FORCE Z330 SSD
2023-04-15Merge tag 'io_uring-6.3-2023-04-14' of git://git.kernel.dk/linuxLinus Torvalds1-1/+1
Pull io_uring fix from Jens Axboe: "Just a small tweak to when task_work needs redirection, marked for stable as well" * tag 'io_uring-6.3-2023-04-14' of git://git.kernel.dk/linux: io_uring: complete request via task work in case of DEFER_TASKRUN
2023-04-14Merge branch 'page_pool-allow-caching-from-safely-localized-napi'Jakub Kicinski8-30/+58
Jakub Kicinski says: ==================== page_pool: allow caching from safely localized NAPI I went back to the explicit "are we in NAPI method", mostly because I don't like having both around :( (even tho I maintain that in_softirq() && !in_hardirq() is as safe, as softirqs do not nest). Still returning the skbs to a CPU, tho, not to the NAPI instance. I reckon we could create a small refcounted struct per NAPI instance which would allow sockets and other users so hold a persisent and safe reference. But that's a bigger change, and I get 90+% recycling thru the cache with just these patches (for RR and streaming tests with 100% CPU use it's almost 100%). Some numbers for streaming test with 100% CPU use (from previous version, but really they perform the same): HW-GRO page=page before after before after recycle: cached: 0 138669686 0 150197505 cache_full: 0 223391 0 74582 ring: 138551933 9997191 149299454 0 ring_full: 0 488 3154 127590 released_refcnt: 0 0 0 0 alloc: fast: 136491361 148615710 146969587 150322859 slow: 1772 1799 144 105 slow_high_order: 0 0 0 0 empty: 1772 1799 144 105 refill: 2165245 156302 2332880 2128 waive: 0 0 0 0 v1: https://lore.kernel.org/all/[email protected]/ rfcv2: https://lore.kernel.org/all/[email protected]/ ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-14bnxt: hook NAPIs to page poolsJakub Kicinski1-0/+1
bnxt has 1:1 mapping of page pools and NAPIs, so it's safe to hoook them up together. Reviewed-by: Tariq Toukan <[email protected]> Tested-by: Dragos Tatulea <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-14page_pool: allow caching from safely localized NAPIJakub Kicinski7-12/+37
Recent patches to mlx5 mentioned a regression when moving from driver local page pool to only using the generic page pool code. Page pool has two recycling paths (1) direct one, which runs in safe NAPI context (basically consumer context, so producing can be lockless); and (2) via a ptr_ring, which takes a spin lock because the freeing can happen from any CPU; producer and consumer may run concurrently. Since the page pool code was added, Eric introduced a revised version of deferred skb freeing. TCP skbs are now usually returned to the CPU which allocated them, and freed in softirq context. This places the freeing (producing of pages back to the pool) enticingly close to the allocation (consumer). If we can prove that we're freeing in the same softirq context in which the consumer NAPI will run - lockless use of the cache is perfectly fine, no need for the lock. Let drivers link the page pool to a NAPI instance. If the NAPI instance is scheduled on the same CPU on which we're freeing - place the pages in the direct cache. With that and patched bnxt (XDP enabled to engage the page pool, sigh, bnxt really needs page pool work :() I see a 2.6% perf boost with a TCP stream test (app on a different physical core than softirq). The CPU use of relevant functions decreases as expected: page_pool_refill_alloc_cache 1.17% -> 0% _raw_spin_lock 2.41% -> 0.98% Only consider lockless path to be safe when NAPI is scheduled - in practice this should cover majority if not all of steady state workloads. It's usually the NAPI kicking in that causes the skb flush. The main case we'll miss out on is when application runs on the same CPU as NAPI. In that case we don't use the deferred skb free path. Reviewed-by: Tariq Toukan <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Tested-by: Dragos Tatulea <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-14net: skb: plumb napi state thru skb freeing pathsJakub Kicinski1-18/+20
We maintain a NAPI-local cache of skbs which is fed by napi_consume_skb(). Going forward we will also try to cache head and data pages. Plumb the "are we in a normal NAPI context" information thru deeper into the freeing path, up to skb_release_data() and skb_free_head()/skb_pp_recycle(). The "not normal NAPI context" comes from netpoll which passes budget of 0 to try to reap the Tx completions but not perform any Rx. Use "bool napi_safe" rather than bare "int budget", the further we get from NAPI the more confusing the budget argument may seem (particularly whether 0 or MAX is the correct value to pass in when not in NAPI). Reviewed-by: Tariq Toukan <[email protected]> Tested-by: Dragos Tatulea <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-14net/mlx5: DR, Enable patterns and arguments for supporting devicesYevgeny Kliteynik1-1/+2
Check if patterns and arguments for modify header action are supported and enable them accordingly. Signed-off-by: Muhammad Sammar <[email protected]> Signed-off-by: Yevgeny Kliteynik <[email protected]> Reviewed-by: Alex Vesker <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2023-04-14net/mlx5: DR, Add support for the pattern/arg parameters in debug dumpYevgeny Kliteynik1-3/+27
Support the pattern/args-based MODIFY_HDR and TNL_L3_TO_L2 actions in dbg dump Signed-off-by: Yevgeny Kliteynik <[email protected]> Reviewed-by: Alex Vesker <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2023-04-14net/mlx5: DR, Modify header action of size 1 optimizationYevgeny Kliteynik3-25/+52
Set modify header action of size 1 directly on the STE for supporting devices, thus reducing number of hops and cache misses. Signed-off-by: Yevgeny Kliteynik <[email protected]> Reviewed-by: Alex Vesker <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2023-04-14net/mlx5: DR, Support decap L3 action using pattern / arg mechanismYevgeny Kliteynik1-16/+5
Use the new accelerated action for decap L3 on RX side: use the mechanism of pattern and argument same as in modify-header action. Signed-off-by: Erez Shitrit <[email protected]> Signed-off-by: Yevgeny Kliteynik <[email protected]> Reviewed-by: Alex Vesker <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2023-04-14net/mlx5: DR, Apply new accelerated modify action and decapl3Yevgeny Kliteynik2-6/+47
If there is support for pattern/args, use the new accelerated modify header action for modify header and decap L3 actions. Otherwise fall back to the old modify-header implementation. Signed-off-by: Yevgeny Kliteynik <[email protected]> Reviewed-by: Alex Vesker <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2023-04-14net/mlx5: DR, Add modify header argument pointer to actions attributesYevgeny Kliteynik3-7/+42
While building the actions, add the pointer of the arguments for accelerated modify list action into the action's attributes. This will be used later on while building the specific STE for this action. Signed-off-by: Yevgeny Kliteynik <[email protected]> Reviewed-by: Alex Vesker <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2023-04-14net/mlx5: DR, Add modify header arg pool mechanismYevgeny Kliteynik4-1/+304
Added new mechanism for handling arguments for modify-header action. The new action "accelerated modify-header" asks for the arguments from separated area from the pattern, this area accessed via general objects. Handling of these object is done via the pool-manager struct. When the new header patterns are supported, while loading the domain, a few pools for argument creations will be created. The requests for allocating/deallocating arg objects are done via the pool manager API. Signed-off-by: Muhammad Sammar <[email protected]> Signed-off-by: Yevgeny Kliteynik <[email protected]> Reviewed-by: Alex Vesker <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2023-04-14net/mlx5: DR, Fix QP continuous allocationYevgeny Kliteynik1-1/+1
When allocating a QP we allocate an RQ and an SQ, the RQ is stored first in memory and followed by the SQ. This allocation is not physically continiuos - it may span across different physical pages. SW Steering code always writes in pairs: 1BB write + 1BB read, or 2 continuous BBs of GTA WQE. This lead to an issue where RQ allocation was 4x16 which is equal to 1 WQE BB, causing 1 BB offset in the page and splitting the GTA WQE between different physical pages. The solution was to create the RQ with a even number of BBs and to have the RQ aligned to a page. Signed-off-by: Alex Vesker <[email protected]> Signed-off-by: Yevgeny Kliteynik <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2023-04-14net/mlx5: DR, Read ICM memory into dedicated bufferYevgeny Kliteynik2-9/+17
Instead of using the write buffer for reading we will use a dedicated buffer only for reading ICM memory. Due to the new support for args, we can have a case with pending_wc being odd number, and with reading into the same write buffer, it is possible to overwrite next write on the same slot. For example: pending_wc is 17 so the buffer for write is: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | and we have requests as follows: r wr wr wr wr wr wr wr wr Now, the first read will be written into the last write because we use the same buffer for read and write, before it was written to the HW and we will have a wrong data in the ICM area. Signed-off-by: Erez Shitrit <[email protected]> Signed-off-by: Yevgeny Kliteynik <[email protected]> Reviewed-by: Alex Vesker <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2023-04-14net/mlx5: DR, Add support for writing modify header argumentYevgeny Kliteynik2-20/+150
The accelerated modify header arguments are written in the HW area with special WQE and specific data format. New function was added to support writing of new argument type. Note that GTA WQE is larger than READ and WRITE, so the queue management logic was updated to support this. Signed-off-by: Yevgeny Kliteynik <[email protected]> Reviewed-by: Alex Vesker <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>