Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
mlx5-updates-2023-04-14
Yevgeny Kliteynik Says:
=======================
SW Steering: Support pattern/args modify_header actions
The following patch series adds support for a new pattern/arguments type
of modify_header actions.
Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.
The new approach comprises of two types of objects: pattern and argument.
Pattern holds header modification templates, later used with corresponding
argument object to create complete header modification actions.
The pattern indicates which headers are modified, while the arguments
provide the specific values.
Therefore a single pattern can be used with different arguments in different
flows, enabling offloading of large number of modify_header flows.
- Patch 1, 2: Add ICM pool for modify-header-pattern objects and implement
patterns cache, allowing patterns reuse for different flows
- Patch 3: Allow for chunk allocation separately for STEv0 and STEv1
- Patch 4: Read related device capabilities
- Patch 5: Add create/destroy functions for the new general object type
- Patch 6: Add support for writing modify header argument to ICM
- Patch 7, 8: Some required fixes to support pattern/arg - separate read
buffer from the write buffer and fix QP continuous allocation
- Patch 9: Add pool for modify header arg objects
- Patch 10, 11, 12: Implement MODIFY_HEADER and TNL_L3_TO_L2 actions with
the new patterns/args design
- Patch 13: Optimization - set modify header action of size 1 directly on
the STE instead of separate pattern/args combination
- Patch 14: Adjust debug dump for patterns/args
- Patch 15: Enable patterns and arguments for supporting devices
=======================
|
|
Aaron Conole says:
====================
selftests: openvswitch: add support for testing upcall interface
The existing selftest suite for openvswitch will work for regression
testing the datapath feature bits, but won't test things like adding
interfaces, or the upcall interface. Here, we add some additional
test facilities.
First, extend the ovs-dpctl.py python module to support the OVS_FLOW
and OVS_PACKET netlink families, with some associated messages. These
can be extended over time, but the initial support is for more well
known cases (output, userspace, and CT).
Next, extend the test suite to test upcalls by adding a datapath,
monitoring the upcall socket associated with the datapath, and then
dumping any upcalls that are received. Compare with expected ARP
upcall via arping.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
The upcall socket interface can be exercised now to make sure that
future feature adjustments to the field can maintain backwards
compatibility.
Signed-off-by: Aaron Conole <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add a basic set of fields to print in a 'dpflow' format. This will be
used by future commits to check for flow fields after parsing, as
well as verifying the flow fields pushed into the kernel from
userspace.
Signed-off-by: Aaron Conole <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Includes an associated test to generate netns and connect
interfaces, with the option to include packet tracing.
This will be used in the future when flow support is added
for additional test cases.
Signed-off-by: Aaron Conole <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
If the 1PPS output was enabled and then lan8841 was configured to be a
follower, then target clock which is used to generate the 1PPS was not
configure correctly. The problem was that for each adjustments of the
time, also the nanosecond part of the target clock was changed.
Therefore the initial nanosecond part of the target clock was changed.
The issue can be observed if both the leader and the follower are
generating 1PPS and see that their PPS are not aligned even if the time
is allined.
The fix consists of not modifying the nanosecond part of the target
clock when adjusting the time. In this way the 1PPS get also aligned.
Fixes: e4ed8ba08e3f ("net: phy: micrel: Add support for PTP_PF_PEROUT for lan8841")
Signed-off-by: Horatiu Vultur <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Here we copy the data from the original buf to the new page. But we
not check that it may be overflow.
As long as the size received(including vnethdr) is greater than 3840
(PAGE_SIZE -VIRTIO_XDP_HEADROOM). Then the memcpy will overflow.
And this is completely possible, as long as the MTU is large, such
as 4096. In our test environment, this will cause crash. Since crash is
caused by the written memory, it is meaningless, so I do not include it.
Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
|
|
This is a proposal to revert commit 914eedcb9ba0ff53c33808.
I found this when writing a simple UFFDIO_API test to be the first unit
test in this set. Two things breaks with the commit:
- UFFDIO_API check was lost and missing. According to man page, the
kernel should reject ioctl(UFFDIO_API) if uffdio_api.api != 0xaa. This
check is needed if the api version will be extended in the future, or
user app won't be able to identify which is a new kernel.
- Feature flags checks were removed, which means UFFDIO_API with a
feature that does not exist will also succeed. According to the man
page, we should (and it makes sense) to reject ioctl(UFFDIO_API) if
unknown features passed in.
Link: https://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 914eedcb9ba0 ("userfaultfd: don't fail on unrecognized features")
Signed-off-by: Peter Xu <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
KASAN report null-ptr-deref:
==================================================================
BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
Write of size 8 at addr 0000000000000000 by task sync/943
CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
Call Trace:
<TASK>
dump_stack_lvl+0x7f/0xc0
print_report+0x2ba/0x340
kasan_report+0xc4/0x120
kasan_check_range+0x1b7/0x2e0
__kasan_check_write+0x24/0x40
bdi_split_work_to_wbs+0x5c5/0x7b0
sync_inodes_sb+0x195/0x630
sync_inodes_one_sb+0x3a/0x50
iterate_supers+0x106/0x1b0
ksys_sync+0x98/0x160
[...]
==================================================================
The race that causes the above issue is as follows:
cpu1 cpu2
-------------------------|-------------------------
inode_switch_wbs
INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
queue_rcu_work(isw_wq, &isw->work)
// queue_work async
inode_switch_wbs_work_fn
wb_put_many(old_wb, nr_switched)
percpu_ref_put_many
ref->data->release(ref)
cgwb_release
queue_work(cgwb_release_wq, &wb->release_work)
// queue_work async
&wb->release_work
cgwb_release_workfn
ksys_sync
iterate_supers
sync_inodes_one_sb
sync_inodes_sb
bdi_split_work_to_wbs
kmalloc(sizeof(*work), GFP_ATOMIC)
// alloc memory failed
percpu_ref_exit
ref->data = NULL
kfree(data)
wb_get(wb)
percpu_ref_get(&wb->refcnt)
percpu_ref_get_many(ref, 1)
atomic_long_add(nr, &ref->data->count)
atomic64_add(i, v)
// trigger null-ptr-deref
bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
wbs. If the allocation of new work fails, the on-stack fallback will be
used and the reference count of the current wb is increased afterwards.
If cgroup writeback membership switches occur before getting the reference
count and the current wb is released as old_wd, then calling wb_get() or
wb_put() will trigger the null pointer dereference above.
This issue was introduced in v4.3-rc7 (see fix tag1). Both
sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
bdi_split_work_to_wbs() can trigger this issue. For scenarios called via
sync_inodes_sb(), originally commit 7fc5854f8c6e ("writeback: synchronize
sync(2) against cgroup writeback membership switches") reduced the
possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
and the issue becomes easily reproducible again.
To solve this problem, percpu_ref_exit() is called under RCU protection to
avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs().
Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
and skip the current wb if wb_tryget() fails because the wb has already
been shutdown.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones")
Signed-off-by: Baokun Li <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Hou Tao <[email protected]>
Cc: yangerkun <[email protected]>
Cc: Zhang Yi <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
In mas_alloc_nodes(), "node->node_count = 0" means to initialize the
node_count field of the new node, but the node may not be a new node. It
may be a node that existed before and node_count has a value, setting it
to 0 will cause a memory leak. At this time, mas->alloc->total will be
greater than the actual number of nodes in the linked list, which may
cause many other errors. For example, out-of-bounds access in
mas_pop_node(), and mas_pop_node() may return addresses that should not be
used. Fix it by initializing node_count only for new nodes.
Also, by the way, an if-else statement was removed to simplify the code.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When using cull option with 'tg' flag, the fprintf is using pid instead
of tgid. It should use tgid instead.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9c8a0a8e599f4a ("tools/vm/page_owner_sort.c: support for user-defined culling rules")
Signed-off-by: Steve Chou <[email protected]>
Cc: Jiajian Ye <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Link: https://lkml.kernel.org/r/d79bc6eaf65e68bd1c2a1e1510ab6291ce5926a6.1681162487.git.jtoppins@redhat.com
Signed-off-by: Jonathan Toppins <[email protected]>
Cc: Colin Ian King <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Konrad Dybcio <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
set_mempolicy_home_node() iterates over a list of VMAs and calls
mbind_range() on each VMA, which also iterates over the singular list of
the VMA passed in and potentially splits the VMA. Since the VMA iterator
is not passed through, set_mempolicy_home_node() may now point to a stale
node in the VMA tree. This can result in a UAF as reported by syzbot.
Avoid the stale maple tree node by passing the VMA iterator through to the
underlying call to split_vma().
mbind_range() is also overly complicated, since there are two calling
functions and one already handles iterating over the VMAs. Simplify
mbind_range() to only handle merging and splitting of the VMAs.
Align the new loop in do_mbind() and existing loop in
set_mempolicy_home_node() to use the reduced mbind_range() function. This
allows for a single location of the range calculation and avoids
constantly looking up the previous VMA (since this is a loop over the
VMAs).
Link: https://lore.kernel.org/linux-mm/[email protected]/
Fixes: 66850be55e8e ("mm/mempolicy: use vma iterator & maple state instead of vma linked list")
Signed-off-by: Liam R. Howlett <[email protected]>
Reported-by: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Tested-by: [email protected]
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
split_huge_page_to_list() WARNs when called for huge zero pages, which
sounds to me too harsh because it does not imply a kernel bug, but just
notifies the event to admins. On the other hand, this is considered as
critical by syzkaller and makes its testing less efficient, which seems to
me harmful.
So replace the VM_WARN_ON_ONCE_FOLIO with pr_warn_ratelimited.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 478d134e9506 ("mm/huge_memory: do not overkill when splitting huge_zero_page")
Signed-off-by: Naoya Horiguchi <[email protected]>
Reported-by: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Reviewed-by: Yang Shi <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Xu Yu <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When the loop over the VMA is terminated early due to an error, the return
code could be overwritten with ENOMEM. Fix the return code by only
setting the error on early loop termination when the error is not set.
User-visible effects include: attempts to run mprotect() against a
special mapping or with a poorly-aligned hugetlb address should return
-EINVAL, but they presently return -ENOMEM. In other cases an -EACCESS
should be returned.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2286a6914c77 ("mm: change mprotect_fixup to vma iterator")
Signed-off-by: Liam R. Howlett <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd
round done in __collapse_huge_page_isolate() after
hpage_collapse_scan_pmd(), during which all the locks will be released
temporarily. It means the pgtable can change during this phase before 2nd
round starts.
It's logically possible some ptes got wr-protected during this phase, and
we can errornously collapse a thp without noticing some ptes are
wr-protected by userfault. e1e267c7928f wanted to avoid it but it only
did that for the 1st phase, not the 2nd phase.
Since __collapse_huge_page_isolate() happens after a round of small page
swapins, we don't need to worry on any !present ptes - if it existed
khugepaged will already bail out. So we only need to check present ptes
with uffd-wp bit set there.
This is something I found only but never had a reproducer, I thought it
was one caused a bug in Muhammad's recent pagemap new ioctl work, but it
turns out it's not the cause of that but an userspace bug. However this
seems to still be a real bug even with a very small race window, still
worth to have it fixed and copy stable.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected")
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Looks like what we fixed for hugetlb in commit 44f86392bdd1 ("mm/hugetlb:
fix uffd-wp handling for migration entries in
hugetlb_change_protection()") similarly applies to THP.
Setting/clearing uffd-wp on THP migration entries is not implemented
properly. Further, while removing migration PMDs considers the uffd-wp
bit, inserting migration PMDs does not consider the uffd-wp bit.
We have to set/clear independently of the migration entry type in
change_huge_pmd() and properly copy the uffd-wp bit in
set_pmd_migration_entry().
Verified using a simple reproducer that triggers migration of a THP, that
the set_pmd_migration_entry() no longer loses the uffd-wp bit.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Cc: <[email protected]>
Cc: Muhammad Usama Anjum <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The ->percpu_pvec_drained was originally introduced by commit d9ed0d08b6c6
("mm: only drain per-cpu pagevecs once per pagevec usage") to drain
per-cpu pagevecs only once per pagevec usage. But after converting the
swap code to be more folio-based, the commit c2bc16817aa0 ("mm/swap: add
folio_batch_move_lru()") breaks this logic, which would cause
->percpu_pvec_drained to be reset to false, that means per-cpu pagevecs
will be drained multiple times per pagevec usage.
In theory, there should be no functional changes when converting code to
be more folio-based. We should call folio_batch_reinit() in
folio_batch_move_lru() instead of folio_batch_init(). And to verify that
we still need ->percpu_pvec_drained, I ran mmtests/sparsetruncate-tiny and
got the following data:
baseline with
baseline/ patch/
Min Time 326.00 ( 0.00%) 328.00 ( -0.61%)
1st-qrtle Time 334.00 ( 0.00%) 336.00 ( -0.60%)
2nd-qrtle Time 338.00 ( 0.00%) 341.00 ( -0.89%)
3rd-qrtle Time 343.00 ( 0.00%) 347.00 ( -1.17%)
Max-1 Time 326.00 ( 0.00%) 328.00 ( -0.61%)
Max-5 Time 327.00 ( 0.00%) 330.00 ( -0.92%)
Max-10 Time 328.00 ( 0.00%) 331.00 ( -0.91%)
Max-90 Time 350.00 ( 0.00%) 357.00 ( -2.00%)
Max-95 Time 395.00 ( 0.00%) 390.00 ( 1.27%)
Max-99 Time 508.00 ( 0.00%) 434.00 ( 14.57%)
Max Time 547.00 ( 0.00%) 476.00 ( 12.98%)
Amean Time 344.61 ( 0.00%) 345.56 * -0.28%*
Stddev Time 30.34 ( 0.00%) 19.51 ( 35.69%)
CoeffVar Time 8.81 ( 0.00%) 5.65 ( 35.87%)
BAmean-99 Time 342.38 ( 0.00%) 344.27 ( -0.55%)
BAmean-95 Time 338.58 ( 0.00%) 341.87 ( -0.97%)
BAmean-90 Time 336.89 ( 0.00%) 340.26 ( -1.00%)
BAmean-75 Time 335.18 ( 0.00%) 338.40 ( -0.96%)
BAmean-50 Time 332.54 ( 0.00%) 335.42 ( -0.87%)
BAmean-25 Time 329.30 ( 0.00%) 332.00 ( -0.82%)
From the above it can be seen that we get similar data to when
->percpu_pvec_drained was introduced, so we still need it. Let's call
folio_batch_reinit() in folio_batch_move_lru() to restore the original
logic.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: c2bc16817aa0 ("mm/swap: add folio_batch_move_lru()")
Signed-off-by: Qi Zheng <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
- Do not pull tasks to the local scheduling group if its average load
is higher than the average system load
* tag 'sched_urgent_for_v6.3_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix imbalance overflow
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
- Drop __init annotation from two rtc functions which get called after
boot is done, in order to prevent a crash
* tag 'x86_urgent_for_v6.3_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/rtc: Remove __init for runtime functions
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
- A fix for NUMA distance handling in the pseries SCM (pmem) driver.
Thanks to Aneesh Kumar K.V.
* tag 'powerpc-6.3-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/papr_scm: Update the NUMA distance table for the target node
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Drop debug info from purgatory objects again
- Document that kernel.org provides prebuilt LLVM toolchains
- Give up handling untracked files for source package builds
- Avoid creating corrupted cpio when KBUILD_BUILD_TIMESTAMP is given
with a pre-epoch data.
- Change panic_show_mem() to a macro to handle variable-length argument
- Compress tarballs on-the-fly again
* tag 'kbuild-fixes-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: do not create intermediate *.tar for tar packages
kbuild: do not create intermediate *.tar for source tarballs
kbuild: merge cmd_archive_linux and cmd_archive_perf
init/initramfs: Fix argument forwarding to panic() in panic_show_mem()
initramfs: Check negative timestamp to prevent broken cpio archive
kbuild: give up untracked files for source package builds
Documentation/llvm: Add a note about prebuilt kernel.org toolchains
purgatory: fix disabling debug info
|
|
Pull ksmbd server fix from Steve French:
"smb311 server preauth integrity negotiate context parsing fix (check
for out of bounds access)"
* tag '6.3-rc6-ksmbd-server-fix' of git://git.samba.org/ksmbd:
ksmbd: avoid out of bounds access in decode_preauth_ctxt()
|
|
Commit 05e96e96a315 ("kbuild: use git-archive for source package
creation") split the compression as a separate step to factor out
the common build rules.
With the previous commit, we got back to the situation where source
tarballs are compressed on-the-fly.
There is no reason to keep the separate compression rules.
Generate the comressed tar packages directly.
Signed-off-by: Masahiro Yamada <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
|
|
Since commit 05e96e96a315 ("kbuild: use git-archive for source package
creation"), a source tarball is created in two steps; create *.tar file
then compress it. I split the compression as a separate rule because I
just thought 'git archive' supported only gzip.
For other compression algorithms, I could pipe the two commands:
$ git archive HEAD | xz > linux.tar.xz
I read git-archive(1) carefully, and I realized GIT had provided a
more elegant way:
$ git -c tar.tar.xz.command=xz archive -o linux.tar.xz HEAD
This commit uses 'tar.tar.*.command' configuration to specify the
compression backend so we can compress a source tarball on-the-fly.
GIT commit 767cf4579f0e ("archive: implement configurable tar filters")
is more than a decade old, so it should be available on almost all build
environments.
Signed-off-by: Masahiro Yamada <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
|
|
The two commands, cmd_archive_linux and cmd_archive_perf, are similar.
Merge them to make it easier to add more changes to the git-archive
command.
Signed-off-by: Masahiro Yamada <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
|
|
Forwarding variadic argument lists can't be done by passing a va_list
to a function with signature foo(...) (as panic() has). It ends up
interpreting the va_list itself as a single argument instead of
iterating it. printf() happily accepts it of course, leading to corrupt
output.
Convert panic_show_mem() to a macro to allow forwarding the arguments.
The function is trivial enough that it's easier than trying to introduce
a vpanic() variant.
Signed-off-by: Benjamin Gray <[email protected]>
Reviewed-by: Andrew Donnellan <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
Similar to commit 4c9d410f32b3 ("initramfs: Check timestamp to prevent
broken cpio archive"), except asserts that the timestamp is
non-negative. This can happen when the KBUILD_BUILD_TIMESTAMP is a value
before UNIX epoch, which may be set when making reproducible builds that
don't want to look like they use a valid date.
While support for dates before 1970 might not be supported, this is more
about preventing undetected CPIO corruption. The printf's use a minimum
length format specifier, and will happily make the field longer than 8
characters if they need to.
Signed-off-by: Benjamin Gray <[email protected]>
Reviewed-by: Andrew Donnellan <[email protected]>
Tested-by: Andrew Donnellan <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
git://git.samba.org/sfrench/cifs-2.6
Pull cifs fix from Steve French:
"Small client fix for better checking for smb311 negotiate context
overflows, also marked for stable"
* tag '6.3-rc6-smb311-client-negcontext-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix negotiate context parsing
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI fixes from Richard Weinberger:
- Fix failure to attach when vid_hdr offset equals the (sub)page size
- Fix for a deadlock in UBI's worker thread
* tag 'ubifs-for-linus-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: Fix failure attaching when vid_hdr offset equals to (sub)page size
ubi: Fix deadlock caused by recursively holding work_sem
|
|
smb311_decode_neg_context() doesn't properly check against SMB packet
boundaries prior to accessing individual negotiate context entries. This
is due to the length check omitting the eight byte smb2_neg_context
header, as well as incorrect decrementing of len_of_ctxts.
Fixes: 5100d8a3fe03 ("SMB311: Improve checking of negotiate security contexts")
Reported-by: Volker Lendecke <[email protected]>
Reviewed-by: Paulo Alcantara (SUSE) <[email protected]>
Signed-off-by: David Disseldorp <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Just two driver fixes"
* tag 'i2c-for-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: ocores: generate stop condition after timeout in polling mode
i2c: mchp-pci1xxxx: Update Timing registers
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fix from James Bottomley:
"One small fix to SCSI Enclosure Services to fix a regression caused by
another recent fix"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ses: Handle enclosure with just a primary component gracefully
|
|
Pull block fix from Jens Axboe:
"A single NVMe quirk entry addition"
* tag 'block-6.3-2023-04-14' of git://git.kernel.dk/linux:
nvme-pci: add NVME_QUIRK_BOGUS_NID for T-FORCE Z330 SSD
|
|
Pull io_uring fix from Jens Axboe:
"Just a small tweak to when task_work needs redirection, marked for
stable as well"
* tag 'io_uring-6.3-2023-04-14' of git://git.kernel.dk/linux:
io_uring: complete request via task work in case of DEFER_TASKRUN
|
|
Jakub Kicinski says:
====================
page_pool: allow caching from safely localized NAPI
I went back to the explicit "are we in NAPI method", mostly
because I don't like having both around :( (even tho I maintain
that in_softirq() && !in_hardirq() is as safe, as softirqs do
not nest).
Still returning the skbs to a CPU, tho, not to the NAPI instance.
I reckon we could create a small refcounted struct per NAPI instance
which would allow sockets and other users so hold a persisent
and safe reference. But that's a bigger change, and I get 90+%
recycling thru the cache with just these patches (for RR and
streaming tests with 100% CPU use it's almost 100%).
Some numbers for streaming test with 100% CPU use (from previous version,
but really they perform the same):
HW-GRO page=page
before after before after
recycle:
cached: 0 138669686 0 150197505
cache_full: 0 223391 0 74582
ring: 138551933 9997191 149299454 0
ring_full: 0 488 3154 127590
released_refcnt: 0 0 0 0
alloc:
fast: 136491361 148615710 146969587 150322859
slow: 1772 1799 144 105
slow_high_order: 0 0 0 0
empty: 1772 1799 144 105
refill: 2165245 156302 2332880 2128
waive: 0 0 0 0
v1: https://lore.kernel.org/all/[email protected]/
rfcv2: https://lore.kernel.org/all/[email protected]/
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
bnxt has 1:1 mapping of page pools and NAPIs, so it's safe
to hoook them up together.
Reviewed-by: Tariq Toukan <[email protected]>
Tested-by: Dragos Tatulea <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.
Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).
If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.
Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.
With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).
The CPU use of relevant functions decreases as expected:
page_pool_refill_alloc_cache 1.17% -> 0%
_raw_spin_lock 2.41% -> 0.98%
Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.
The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.
Reviewed-by: Tariq Toukan <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Tested-by: Dragos Tatulea <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
We maintain a NAPI-local cache of skbs which is fed by napi_consume_skb().
Going forward we will also try to cache head and data pages.
Plumb the "are we in a normal NAPI context" information thru
deeper into the freeing path, up to skb_release_data() and
skb_free_head()/skb_pp_recycle(). The "not normal NAPI context"
comes from netpoll which passes budget of 0 to try to reap
the Tx completions but not perform any Rx.
Use "bool napi_safe" rather than bare "int budget",
the further we get from NAPI the more confusing the budget
argument may seem (particularly whether 0 or MAX is the
correct value to pass in when not in NAPI).
Reviewed-by: Tariq Toukan <[email protected]>
Tested-by: Dragos Tatulea <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Check if patterns and arguments for modify header action
are supported and enable them accordingly.
Signed-off-by: Muhammad Sammar <[email protected]>
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Reviewed-by: Alex Vesker <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Support the pattern/args-based MODIFY_HDR and TNL_L3_TO_L2 actions in dbg dump
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Reviewed-by: Alex Vesker <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Set modify header action of size 1 directly on the STE for supporting
devices, thus reducing number of hops and cache misses.
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Reviewed-by: Alex Vesker <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Use the new accelerated action for decap L3 on RX side:
use the mechanism of pattern and argument same as in
modify-header action.
Signed-off-by: Erez Shitrit <[email protected]>
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Reviewed-by: Alex Vesker <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
If there is support for pattern/args, use the new accelerated modify
header action for modify header and decap L3 actions.
Otherwise fall back to the old modify-header implementation.
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Reviewed-by: Alex Vesker <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
While building the actions, add the pointer of the arguments for
accelerated modify list action into the action's attributes.
This will be used later on while building the specific STE
for this action.
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Reviewed-by: Alex Vesker <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Added new mechanism for handling arguments for modify-header action.
The new action "accelerated modify-header" asks for the arguments from
separated area from the pattern, this area accessed via general objects.
Handling of these object is done via the pool-manager struct.
When the new header patterns are supported, while loading the domain,
a few pools for argument creations will be created. The requests for
allocating/deallocating arg objects are done via the pool manager API.
Signed-off-by: Muhammad Sammar <[email protected]>
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Reviewed-by: Alex Vesker <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
When allocating a QP we allocate an RQ and an SQ, the RQ is stored first
in memory and followed by the SQ.
This allocation is not physically continiuos - it may span across different
physical pages. SW Steering code always writes in pairs: 1BB write + 1BB read,
or 2 continuous BBs of GTA WQE.
This lead to an issue where RQ allocation was 4x16 which is equal to 1 WQE BB,
causing 1 BB offset in the page and splitting the GTA WQE between different
physical pages.
The solution was to create the RQ with a even number of BBs and to have the
RQ aligned to a page.
Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Instead of using the write buffer for reading we will use a dedicated
buffer only for reading ICM memory.
Due to the new support for args, we can have a case with pending_wc
being odd number, and with reading into the same write buffer, it is
possible to overwrite next write on the same slot.
For example:
pending_wc is 17 so the buffer for write is:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
and we have requests as follows:
r wr wr wr wr wr wr wr wr
Now, the first read will be written into the last write because we use
the same buffer for read and write, before it was written to the HW and
we will have a wrong data in the ICM area.
Signed-off-by: Erez Shitrit <[email protected]>
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Reviewed-by: Alex Vesker <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
The accelerated modify header arguments are written in the HW area
with special WQE and specific data format.
New function was added to support writing of new argument type.
Note that GTA WQE is larger than READ and WRITE, so the queue
management logic was updated to support this.
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Reviewed-by: Alex Vesker <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|