Age | Commit message (Collapse) | Author | Files | Lines |
|
A conntrack entry can be inserted to the connection tracking table if there
is no existing entry with an identical tuple in either direction.
Example:
INITIATOR -> NAT/PAT -> RESPONDER
Initiator passes through NAT/PAT ("us") and SNAT is done (saddr rewrite).
Then, later, NAT/PAT machine itself also wants to connect to RESPONDER.
This will not work if the SNAT done earlier has same IP:PORT source pair.
Conntrack table has:
ORIGINAL: $IP_INITATOR:$SPORT -> $IP_RESPONDER:$DPORT
REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT
and new locally originating connection wants:
ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT
REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT
This is handled by the NAT engine which will do a source port reallocation
for the locally originating connection that is colliding with an existing
tuple by attempting a source port rewrite.
This is done even if this new connection attempt did not go through a
masquerade/snat rule.
There is a rare race condition with connection-less protocols like UDP,
where we do the port reallocation even though its not needed.
This happens when new packets from the same, pre-existing flow are received
in both directions at the exact same time on different CPUs after the
conntrack table was flushed (or conntrack becomes active for first time).
With strict ordering/single cpu, the first packet creates new ct entry and
second packet is resolved as established reply packet.
With parallel processing, both packets are picked up as new and both get
their own ct entry.
In this case, the 'reply' packet (picked up as ORIGINAL) can be mangled by
NAT engine because a port collision is detected.
This change isn't enough to prevent a packet drop later during
nf_conntrack_confirm(), the existing clash resolution strategy will not
detect such reverse clash case. This is resolved by a followup patch.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
Some packetdrill tests are flaky in debug mode. As discussed, increase
tolerance.
We have been doing this for debug builds outside ksft too.
Previous setting was 10000. A manual 50 runs in virtme-ng showed two
failures that needed 12000. To be on the safe side, Increase to 14000.
Link: https://lore.kernel.org/netdev/Zuhhe4-MQHd3EkfN@mini-arch/
Fixes: 1e42f73fd3c2 ("selftests/net: packetdrill: import tcp/zerocopy")
Reported-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Willem de Bruijn <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Acked-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
|
|
The work can submit URBs and the URBs can schedule the work.
This cycle needs to be broken, when a device is to be stopped.
Use a flag to do so.
This is a design issue as old as the driver.
Signed-off-by: Oliver Neukum <[email protected]>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
CC: [email protected]
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Commit 5fabb01207a2 ("net: stmmac: Add initial XDP support") sets
PP_FLAG_DMA_SYNC_DEV flag for page_pool unconditionally,
page_pool_recycle_direct() will call page_pool_dma_sync_for_device()
on every page even the page is not going to be reused by XDP program.
When XDP is not enabled, the page which holds the received buffer
will be recycled once the buffer is copied into new SKB by
skb_copy_to_linear_data(), then the MAC core will never reuse this
page any longer. Always setting PP_FLAG_DMA_SYNC_DEV wastes CPU cycles
on unnecessary calling of page_pool_dma_sync_for_device().
After this patch, up to 9% noticeable performance improvement was observed
on certain platforms.
Fixes: 5fabb01207a2 ("net: stmmac: Add initial XDP support")
Signed-off-by: Furong Xu <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Currently, the virtio-net driver will perform a pre-dma-mapping for
small or mergeable RX buffer. But for small packets, a mismatched address
without VIRTNET_RX_PAD and xdp_headroom is used for unmapping.
That will result in unsynchronized buffers when SWIOTLB is enabled, for
example, when running as a TDX guest.
This patch unifies the address passed to the virtio core as the address of
the virtnet header and fixes the mismatched buffer address.
Changes from v2: unify the buf that passed to the virtio core in small
and merge mode.
Changes from v1: Use ctx to get xdp_headroom.
Fixes: 295525e29a5b ("virtio_net: merge dma operations when filling mergeable buffers")
Signed-off-by: Wenbo Li <[email protected]>
Signed-off-by: Jiahui Cen <[email protected]>
Signed-off-by: Ying Fang <[email protected]>
Reviewed-by: Xuan Zhuo <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Pull more block updates from Jens Axboe:
- Improve blk-integrity segment counting and merging (Keith)
- NVMe pull request via Keith:
- Multipath fixes (Hannes)
- Sysfs attribute list NULL terminate fix (Shin'ichiro)
- Remove problematic read-back (Keith)
- Fix for a regression with the IO scheduler switching freezing from
6.11 (Damien)
- Use a raw spinlock for sbitmap, as it may get called from preempt
disabled context (Ming)
- Cleanup for bd_claiming waiting, using var_waitqueue() rather than
the bit waitqueues, as that more accurately describes that it does
(Neil)
- Various cleanups (Kanchan, Qiu-ji, David)
* tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux:
nvme: remove CC register read-back during enabling
nvme: null terminate nvme_tls_attrs
nvme-multipath: avoid hang on inaccessible namespaces
nvme-multipath: system fails to create generic nvme device
lib/sbitmap: define swap_lock as raw_spinlock_t
block: Remove unused blk_limits_io_{min,opt}
drbd: Fix atomicity violation in drbd_uuid_set_bm()
block: Fix elv_iosched_local_module handling of "none" scheduler
block: remove bogus union
block: change wait on bd_claiming to use a var_waitqueue
blk-integrity: improved sg segment mapping
block: unexport blk_rq_count_integrity_sg
nvme-rdma: use request to get integrity segments
scsi: use request to get integrity segments
block: provide a request helper for user integrity segments
blk-integrity: consider entire bio list for merging
blk-integrity: properly account for segments
blk-mq: set the nr_integrity_segments from bio
blk-mq: unconditional nr_integrity_segments
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"Some driver specific fixes that came in during the merge window.
Lorenzo Bianconi did some extra testing on the recently added arioha
driver and found some issues, Alexander Dahl fixed some issues with
signal delays in the Atmel QSPI driver and Jinjie Ruan has been fixing
some nits with runtime PM cleanup"
* tag 'spi-fix-v6.12-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: atmel-quadspi: Avoid overwriting delay register settings
spi: airoha: remove read cache in airoha_snand_dirmap_read()
spi: spi-fsl-lpspi: Undo runtime PM changes at driver exit time
spi: atmel-quadspi: Undo runtime PM changes at driver exit time
spi: airoha: fix airoha_snand_{write,read}_data data_len estimation
spi: airoha: fix dirmap_{read,write} operations
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"More conversions of DT bindings to yaml. There is one new driver, for
the DFRobot SD2405AL and support for important features of the stm32
RTC. Summary:
New driver:
- DFRobot SD2405AL
Drivers:
- stm32: add alarm A out and LSCO support
- sun6i: disable automatic clock input switching
- m48t59: set range"
* tag 'rtc-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
rtc: rc5t619: use proper module tables
rtc: m48t59: set range
dt-bindings: rtc: microcrystal,rv3028: add #clock-cells property
rtc: m48t59: Remove division condition with direct comparison
rtc: at91sam9: fix OF node leak in probe() error path
rtc: sun6i: disable automatic clock input switching
dt-bindings: rtc: Drop non-trivial duplicate compatibles
dt-bindings: vendor-prefixes: Add DFRobot.
dt-bindings: rtc: Add support for SD2405AL.
rtc: Add driver for SD2405AL
rtc: s35390a: Drop vendorless compatible string from match table
rtc: twl: convert comma to semicolon
dt-bindings: rtc: sprd,sc2731-rtc: convert to YAML
rtc: stm32: add alarm A out feature
rtc: stm32: add Low Speed Clock Output (LSCO) support
rtc: stm32: add pinctrl and pinmux interfaces
dt-bindings: rtc: stm32: describe pinmux nodes
|
|
This reverts commit 9184b17fbc23 ("dt-bindings: input: Goodix SPI HID
Touchscreen") because it duplicates existing binding leadings to errors:
goodix,gt7986u.example.dtb:
touchscreen@0: compatible: 'oneOf' conditional failed, one must be fixed:
['goodix,gt7986u'] is too short
'goodix,gt7375p' was expected
This was reported on mailing list on 6th of September, but no reaction
happened from contributor or maintainer to fix it.
Therefore let's drop binding which breaks and duplicates existing one.
Fixes: 9184b17fbc23 ("dt-bindings: input: Goodix SPI HID Touchscreen")
Reported-by: Rob Herring <[email protected]>
Closes: https://lore.kernel.org/all/CAL_Jsq+QfTtRj_JCqXzktQ49H8VUnztVuaBjvvkg3fwEHniUHw@mail.gmail.com/
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
|
|
Drop support for Devicetree from, because the binding is being reverted
(on basis of duplicating existing binding) and property was not added to
the original binding.
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
|
|
The km.state is not checked in driver's delayed work. When
xfrm_state_check_expire() is called, the state can be reset to
XFRM_STATE_EXPIRED, even if it is XFRM_STATE_DEAD already. This
happens when xfrm state is deleted, but not freed yet. As
__xfrm_state_delete() is called again in xfrm timer, the following
crash occurs.
To fix this issue, skip xfrm_state_check_expire() if km.state is not
XFRM_STATE_VALID.
Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] SMP
CPU: 5 UID: 0 PID: 7448 Comm: kworker/u102:2 Not tainted 6.11.0-rc2+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Workqueue: mlx5e_ipsec: eth%d mlx5e_ipsec_handle_sw_limits [mlx5_core]
RIP: 0010:__xfrm_state_delete+0x3d/0x1b0
Code: 0f 84 8b 01 00 00 48 89 fd c6 87 c8 00 00 00 05 48 8d bb 40 10 00 00 e8 11 04 1a 00 48 8b 95 b8 00 00 00 48 8b 85 c0 00 00 00 <48> 89 42 08 48 89 10 48 8b 55 10 48 b8 00 01 00 00 00 00 ad de 48
RSP: 0018:ffff88885f945ec8 EFLAGS: 00010246
RAX: dead000000000122 RBX: ffffffff82afa940 RCX: 0000000000000036
RDX: dead000000000100 RSI: 0000000000000000 RDI: ffffffff82afb980
RBP: ffff888109a20340 R08: ffff88885f945ea0 R09: 0000000000000000
R10: 0000000000000000 R11: ffff88885f945ff8 R12: 0000000000000246
R13: ffff888109a20340 R14: ffff88885f95f420 R15: ffff88885f95f400
FS: 0000000000000000(0000) GS:ffff88885f940000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2163102430 CR3: 00000001128d6001 CR4: 0000000000370eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
? die_addr+0x33/0x90
? exc_general_protection+0x1a2/0x390
? asm_exc_general_protection+0x22/0x30
? __xfrm_state_delete+0x3d/0x1b0
? __xfrm_state_delete+0x2f/0x1b0
xfrm_timer_handler+0x174/0x350
? __xfrm_state_delete+0x1b0/0x1b0
__hrtimer_run_queues+0x121/0x270
hrtimer_run_softirq+0x88/0xd0
handle_softirqs+0xcc/0x270
do_softirq+0x3c/0x50
</IRQ>
<TASK>
__local_bh_enable_ip+0x47/0x50
mlx5e_ipsec_handle_sw_limits+0x7d/0x90 [mlx5_core]
process_one_work+0x137/0x2d0
worker_thread+0x28d/0x3a0
? rescuer_thread+0x480/0x480
kthread+0xb8/0xe0
? kthread_park+0x80/0x80
ret_from_fork+0x2d/0x50
? kthread_park+0x80/0x80
ret_from_fork_asm+0x11/0x20
</TASK>
Fixes: b2f7b01d36a9 ("net/mlx5e: Simulate missing IPsec TX limits hardware functionality")
Signed-off-by: Jianbo Liu <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
When having larger RQ sizes and small MTUs sizes, the hd_per_wq variable
can overflow. Like in the following case:
$> ethtool --set-ring eth1 rx 8192
$> ip link set dev eth1 mtu 144
$> ethtool --features eth1 rx-gro-hw on
... yields in dmesg:
mlx5_core 0000:08:00.1: mlx5_cmd_out_err:808:(pid 194797): CREATE_MKEY(0x200) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x3bf6f), err(-22)
because hd_per_wq is 64K which overflows to 0 and makes the command
fail.
This patch increases the variable size to 32 bit.
Fixes: 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO")
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Fixed all the 'E2BIG' returns in error flow of functions to
the negative '-E2BIG' as we are using negative error codes
everywhere in HWS code.
This also fixes the following smatch warnings:
"warn: was negative '-E2BIG' intended?"
Fixes: 74a778b4a63f ("net/mlx5: HWS, added definers handling")
Reported-by: Dan Carpenter <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
When SQ creation fails, call the appropriate mlx5_core destroy function.
This fixes the following smatch warnings:
divers/net/ethernet/mellanox/mlx5/core/steering/hws/mlx5hws_send.c:739
hws_send_ring_open_sq() warn: 'sq->dep_wqe' double freed
hws_send_ring_open_sq() warn: 'sq->wq_ctrl.buf.frags' double freed
hws_send_ring_open_sq() warn: 'sq->wr_priv' double freed
Fixes: 2ca62599aa0b ("net/mlx5: HWS, added send engine and context handling")
Reported-by: Dan Carpenter <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Fixing the wrong size of a field in hca_cap_2.
The bug was introduced by adding new fields for HWS
and not fixing the reserved field size.
Fixes: 34c626c3004a ("net/mlx5: Added missing mlx5_ifc definition for HW Steering")
Signed-off-by: Yevgeny Kliteynik <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
In mlx5e_tir_builder_alloc() kvzalloc() may return NULL
which is dereferenced on the next line in a reference
to the modify field.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: a6696735d694 ("net/mlx5e: Convert TIR to a dedicated object")
Signed-off-by: Elena Salomatkina <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Kalesh AP <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Collecting crdump involves reading vsc registers from pci config space
of mlx device, which can take long time to complete. This might result
in starving other threads waiting to run on the cpu.
Numbers I got from testing ConnectX-5 Ex MCX516A-CDAT in the lab:
- mlx5_vsc_gw_read_block_fast() was called with length = 1310716.
- mlx5_vsc_gw_read_fast() reads 4 bytes at a time. It was not used to
read the entire 1310716 bytes. It was called 53813 times because
there are jumps in read_addr.
- On average mlx5_vsc_gw_read_fast() took 35284.4ns.
- In total mlx5_vsc_wait_on_flag() called vsc_read() 54707 times.
The average time for each call was 17548.3ns. In some instances
vsc_read() was called more than one time when the flag was not set.
As expected the thread released the cpu after 16 iterations in
mlx5_vsc_wait_on_flag().
- Total time to read crdump was 35284.4ns * 53813 ~= 1.898s.
It was seen in the field that crdump can take more than 5 seconds to
complete. During that time mlx5_vsc_wait_on_flag() did not release the
cpu because it did not complete 16 iterations. It is believed that pci
config reads were slow. Adding cond_resched() every 128 register read
improves the situation. In the common case the, crdump takes ~1.8989s,
the thread yields the cpu every ~4.51ms. If crdump takes ~5s, the thread
yields the cpu every ~18.0ms.
Fixes: 8b9d8baae1de ("net/mlx5: Add Crdump support")
Reviewed-by: Yuanyuan Zhong <[email protected]>
Signed-off-by: Mohamed Khalfella <[email protected]>
Reviewed-by: Moshe Shemesh <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Remove the erroneous unmap in case no DMA mapping was established
The multi-packet WQE transmit code attempts to obtain a DMA mapping for
the skb. This could fail, e.g. under memory pressure, when the IOMMU
driver just can't allocate more memory for page tables. While the code
tries to handle this in the path below the err_unmap label it erroneously
unmaps one entry from the sq's FIFO list of active mappings. Since the
current map attempt failed this unmap is removing some random DMA mapping
that might still be required. If the PCI function now presents that IOVA,
the IOMMU may assumes a rogue DMA access and e.g. on s390 puts the PCI
function in error state.
The erroneous behavior was seen in a stress-test environment that created
memory pressure.
Fixes: 5af75c747e2a ("net/mlx5e: Enhanced TX MPWQE for SKBs")
Signed-off-by: Gerd Bayer <[email protected]>
Reviewed-by: Zhu Yanjun <[email protected]>
Acked-by: Maxim Mikityanskiy <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock updates from Mike Rapoport:
- new memblock_estimated_nr_free_pages() helper to replace
totalram_pages() which is less accurate when
CONFIG_DEFERRED_STRUCT_PAGE_INIT is set
- fixes for memblock tests
* tag 'memblock-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
s390/mm: get estimated free pages by memblock api
kernel/fork.c: get estimated free pages by memblock api
mm/memblock: introduce a new helper memblock_estimated_nr_free_pages()
memblock test: fix implicit declaration of function 'strscpy'
memblock test: fix implicit declaration of function 'isspace'
memblock test: fix implicit declaration of function 'memparse'
memblock test: add the definition of __setup()
memblock test: fix implicit declaration of function 'virt_to_phys'
tools/testing: abstract two init.h into common include directory
memblock tests: include export.h in linkage.h as kernel dose
memblock tests: include memory_hotplug.h in mmzone.h as kernel dose
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc
Pull sparc32 update from Andreas Larsson:
- Remove an unused variable for sparc32
* tag 'sparc-for-6.12-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc:
arch/sparc: remove unused varible paddrbase in function leon_swprobe()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix build error in vdso32 when building 64-bit with COMPAT=y and -Os
- Fix build error in pseries EEH when CONFIG_DEBUG_FS is not set
Thanks to Christophe Leroy, Narayana Murty N, Christian Zigotzky, and
Ritesh Harjani.
* tag 'powerpc-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/pseries/eeh: move pseries_eeh_err_inject() outside CONFIG_DEBUG_FS block
powerpc/vdso32: Fix use of crtsavres for PPC64
|
|
Pull clang-format updates from Miguel Ojeda:
"A routine update of the 'for_each' macro list"
* tag 'clang-format-6.12' of https://github.com/ojeda/linux:
clang-format: Update with v6.11-rc1's `for_each` macro list
|
|
Currently the Rust support is gated on not having MODVERSIONS enabled,
and as a result an "allmodconfig" build will disable Rust build tests.
While MODVERSIONS configurations are worth build testing, the feature is
not actually meaningful unless you run the result, and I'd rather get
build coverage of Rust than MODVERSIONS. So let's disable MODVERSIONS
for build testing until the Rust side clears up.
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Pull Rust updates from Miguel Ojeda:
"Toolchain and infrastructure:
- Support 'MITIGATION_{RETHUNK,RETPOLINE,SLS}' (which cleans up
objtool warnings), teach objtool about 'noreturn' Rust symbols and
mimic '___ADDRESSABLE()' for 'module_{init,exit}'. With that, we
should be objtool-warning-free, so enable it to run for all Rust
object files.
- KASAN (no 'SW_TAGS'), KCFI and shadow call sanitizer support.
- Support 'RUSTC_VERSION', including re-config and re-build on
change.
- Split helpers file into several files in a folder, to avoid
conflicts in it. Eventually those files will be moved to the right
places with the new build system. In addition, remove the need to
manually export the symbols defined there, reusing existing
machinery for that.
- Relax restriction on configurations with Rust + GCC plugins to just
the RANDSTRUCT plugin.
'kernel' crate:
- New 'list' module: doubly-linked linked list for use with reference
counted values, which is heavily used by the upcoming Rust Binder.
This includes 'ListArc' (a wrapper around 'Arc' that is guaranteed
unique for the given ID), 'AtomicTracker' (tracks whether a
'ListArc' exists using an atomic), 'ListLinks' (the prev/next
pointers for an item in a linked list), 'List' (the linked list
itself), 'Iter' (an iterator over a 'List'), 'Cursor' (a cursor
into a 'List' that allows to remove elements), 'ListArcField' (a
field exclusively owned by a 'ListArc'), as well as support for
heterogeneous lists.
- New 'rbtree' module: red-black tree abstractions used by the
upcoming Rust Binder.
This includes 'RBTree' (the red-black tree itself), 'RBTreeNode' (a
node), 'RBTreeNodeReservation' (a memory reservation for a node),
'Iter' and 'IterMut' (immutable and mutable iterators), 'Cursor'
(bidirectional cursor that allows to remove elements), as well as
an entry API similar to the Rust standard library one.
- 'init' module: add 'write_[pin_]init' methods and the
'InPlaceWrite' trait. Add the 'assert_pinned!' macro.
- 'sync' module: implement the 'InPlaceInit' trait for 'Arc' by
introducing an associated type in the trait.
- 'alloc' module: add 'drop_contents' method to 'BoxExt'.
- 'types' module: implement the 'ForeignOwnable' trait for
'Pin<Box<T>>' and improve the trait's documentation. In addition,
add the 'into_raw' method to the 'ARef' type.
- 'error' module: in preparation for the upcoming Rust support for
32-bit architectures, like arm, locally allow Clippy lint for
those.
Documentation:
- https://rust.docs.kernel.org has been announced, so link to it.
- Enable rustdoc's "jump to definition" feature, making its output a
bit closer to the experience in a cross-referencer.
- Debian Testing now also provides recent Rust releases (outside of
the freeze period), so add it to the list.
MAINTAINERS:
- Trevor is joining as reviewer of the "RUST" entry.
And a few other small bits"
* tag 'rust-6.12' of https://github.com/Rust-for-Linux/linux: (54 commits)
kasan: rust: Add KASAN smoke test via UAF
kbuild: rust: Enable KASAN support
rust: kasan: Rust does not support KHWASAN
kbuild: rust: Define probing macros for rustc
kasan: simplify and clarify Makefile
rust: cfi: add support for CFI_CLANG with Rust
cfi: add CONFIG_CFI_ICALL_NORMALIZE_INTEGERS
rust: support for shadow call stack sanitizer
docs: rust: include other expressions in conditional compilation section
kbuild: rust: replace proc macros dependency on `core.o` with the version text
kbuild: rust: rebuild if the version text changes
kbuild: rust: re-run Kconfig if the version text changes
kbuild: rust: add `CONFIG_RUSTC_VERSION`
rust: avoid `box_uninit_write` feature
MAINTAINERS: add Trevor Gross as Rust reviewer
rust: rbtree: add `RBTree::entry`
rust: rbtree: add cursor
rust: rbtree: add mutable iterator
rust: rbtree: add iterator
rust: rbtree: add red-black tree implementation backed by the C version
...
|
|
Add a test case for tracepoint events on modules. This checks if it can add
and remove the events correctly.
Link: https://lore.kernel.org/all/172397781494.286558.7581515061075998225.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|
|
Support raw tracepoint events on future loaded (unloaded) modules.
This allows user to create raw tracepoint events which can be used from
module's __init functions.
Note: since the kernel does not have any information about the tracepoints
in the unloaded modules, fprobe events can not check whether the tracepoint
exists nor extend the BTF based arguments.
Link: https://lore.kernel.org/all/172397780593.286558.18360375226968537828.stgit@devnote2/
Suggested-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|
|
Support raw tracepoint event on module by fprobe events.
Since it only uses for_each_kernel_tracepoint() to find a tracepoint,
the tracepoints on modules are not handled. Thus if user specified a
tracepoint on a module, it shows an error.
This adds new for_each_module_tracepoint() API to tracepoint subsystem,
and uses it to find tracepoints on modules.
Link: https://lore.kernel.org/all/172397779651.286558.15903703620679186867.stgit@devnote2/
Reported-by: don <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|
|
Add for_each_tracepoint_in_module() function to iterate tracepoints in
a module. This API is needed for handling tracepoints in a loading
module from tracepoint_module_notifier callback function.
This also update for_each_module_tracepoint() to pass the module to
callback function so that it can find module easily.
Link: https://lore.kernel.org/all/172397778740.286558.15781131277732977643.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|
|
Add for_each_module_tracepoint() for iterating over tracepoints
on modules. This is similar to the for_each_kernel_tracepoint()
but only for the tracepoints on modules (not including kernel
built-in tracepoints).
Link: https://lore.kernel.org/all/172397777800.286558.14554748203446214056.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|
|
The init_test_probes() have been removed since
commit e44e81c5b90f ("kprobes: convert tests to kunit"), and now
it is useless, so remove it.
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Gaosheng Cui <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|
|
trace_uprobe->nhit counter is not incremented atomically, so its value
is questionable in when uprobe is hit on multiple CPUs simultaneously.
Also, doing this shared counter increment across many CPUs causes heavy
cache line bouncing, limiting uprobe/uretprobe performance scaling with
number of CPUs.
Solve both problems by making this a per-CPU counter.
Link: https://lore.kernel.org/all/[email protected]/
Reviewed-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|
|
When the driver needs to send new packets to the device, it always
queues the new sk_buffs into an intermediate queue (send_pkt_queue)
and schedules a worker (send_pkt_work) to then queue them into the
virtqueue exposed to the device.
This increases the chance of batching, but also introduces a lot of
latency into the communication. So we can optimize this path by
adding a fast path to be taken when there is no element in the
intermediate queue, there is space available in the virtqueue,
and no other process that is sending packets (tx_lock held).
The following benchmarks were run to check improvements in latency and
throughput. The test bed is a host with Intel i7-10700KF CPU @ 3.80GHz
and L1 guest running on QEMU/KVM with vhost process and all vCPUs
pinned individually to pCPUs.
- Latency
Tool: Fio version 3.37-56
Mode: pingpong (h-g-h)
Test runs: 50
Runtime-per-test: 50s
Type: SOCK_STREAM
In the following fio benchmark (pingpong mode) the host sends
a payload to the guest and waits for the same payload back.
fio process pinned both inside the host and the guest system.
Before: Linux 6.9.8
Payload 64B:
1st perc. overall 99th perc.
Before 12.91 16.78 42.24 us
After 9.77 13.57 39.17 us
Payload 512B:
1st perc. overall 99th perc.
Before 13.35 17.35 41.52 us
After 10.25 14.11 39.58 us
Payload 4K:
1st perc. overall 99th perc.
Before 14.71 19.87 41.52 us
After 10.51 14.96 40.81 us
- Throughput
Tool: iperf-vsock
The size represents the buffer length (-l) to read/write
P represents the number of parallel streams
P=1
4K 64K 128K
Before 6.87 29.3 29.5 Gb/s
After 10.5 39.4 39.9 Gb/s
P=2
4K 64K 128K
Before 10.5 32.8 33.2 Gb/s
After 17.8 47.7 48.5 Gb/s
P=4
4K 64K 128K
Before 12.7 33.6 34.2 Gb/s
After 16.9 48.1 50.5 Gb/s
The performance improvement is related to this optimization,
I used a ebpf kretprobe on virtio_transport_send_skb to check
that each packet was sent directly to the virtqueue
Co-developed-by: Marco Pinna <[email protected]>
Signed-off-by: Marco Pinna <[email protected]>
Signed-off-by: Luigi Leonardi <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
|
|
Preliminary patch to introduce an optimization to the
enqueue system.
All the code used to enqueue a packet into the virtqueue
is removed from virtio_transport_send_pkt_work()
and moved to the new virtio_transport_send_skb() function.
Co-developed-by: Luigi Leonardi <[email protected]>
Signed-off-by: Luigi Leonardi <[email protected]>
Signed-off-by: Marco Pinna <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
This 'struct kobj_type' is not modified. It is only used in
kobject_init_and_add() which takes a 'const struct kobj_type *ktype'
parameter.
Constifying this structure and moving it to a read-only section,
and this can increase over all security.
```
[Before]
text data bss dec hex filename
5974 1008 96 7078 1ba6 drivers/firmware/qemu_fw_cfg.o
[After]
text data bss dec hex filename
6038 944 96 7078 1ba6 drivers/firmware/qemu_fw_cfg.o
```
Signed-off-by: Hongbo Li <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
Currently, when a new MR is set up, the old MR is deleted. MR deletion
is about 30-40% the time of MR creation. As deleting the old MR is not
important for the process of setting up the new MR, this operation
can be postponed.
This series adds a workqueue that does MR garbage collection at a later
point. If the MR lock is taken, the handler will back off and
reschedule. The exception during shutdown: then the handler must
not postpone the work.
Note that this is only a speculative optimization: if there is some
mapping operation that is triggered while the garbage collector handler
has the lock taken, this operation it will have to wait for the handler
to finish.
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
There's currently not a lot of action happening during
the init/destroy of MR resources. But more will be added
in the upcoming patches.
As the mr mutex lock init/destroy has been moved to these
new functions, the lifetime has now shifted away from
mlx5_vdpa_alloc_resources() / mlx5_vdpa_free_resources()
into these new functions. However, the lifetime at the
outer scope remains the same:
mlx5_vdpa_dev_add() / mlx5_vdpa_dev_free()
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
Now that the mr resources have their own namespace in the
struct, give the lock a clearer name.
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
Group all mapping related resources into their own structure.
Upcoming patches will add more members in this new structure.
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
A followup patch will use this name for something else.
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
Use the async interface to issue MTT MKEY deletion.
This makes destroy_user_mr() on average 8x times faster.
This number is also dependent on the size of the MR being
deleted.
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
Use the async interface to issue MTT MKEY creation.
Extra care is taken at the allocation of FW input commands
due to the MTT tables having variable sizes depending on
MR.
The indirect MKEY is still created synchronously at the
end as the direct MKEYs need to be filled in.
This makes create_user_mr() 3-5x faster, depending on
the size of the MR.
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
The virtio-vsock driver is already under VM SOCKETS (AF_VSOCK),
managed pricipally with the net tree, and VIRTIO AND VHOST
VSOCK DRIVER. However, changes that only affect the virtio part
usually go with Michael's tree, so let's also put the driver in
the VIRTIO CORE section to have its maintainers in CC for changes
to the virtio-vsock driver.
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Signed-off-by: Stefano Garzarella <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Acked-by: Jason Wang <[email protected]>
|
|
Introduce sysfs entries to provide visibility to the multiple queues
used by the Virtio FS device. This enhancement allows users to query
information about these queues.
Specifically, add two sysfs entries:
1. Queue name: Provides the name of each queue (e.g. hiprio/requests.8).
2. CPU list: Shows the list of CPUs that can process requests for each
queue.
The CPU list feature is inspired by similar functionality in the block
MQ layer, which provides analogous sysfs entries for block devices.
These new sysfs entries will improve observability and aid in debugging
and performance tuning of Virtio FS devices.
Reviewed-by: Idan Zach <[email protected]>
Reviewed-by: Shai Malin <[email protected]>
Signed-off-by: Max Gurtovoy <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
Introduce a new helper function virtio_fs_put_locked to encapsulate the
common pattern of releasing a virtio_fs reference while holding a lock.
The existing virtio_fs_put helper will be used to release a virtio_fs
reference while not holding a lock.
Also add an assertion in case the lock is not taken when it should.
Reviewed-by: Idan Zach <[email protected]>
Reviewed-by: Shai Malin <[email protected]>
Signed-off-by: Max Gurtovoy <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
|
|
There is no caller and implementation in tree.
Signed-off-by: Yue Haibing <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
Reviewed-by: Zhu Lingshan <[email protected]>
Reviewed-by: Shannon Nelson <<a href="mailto:[email protected]" target="_blank">[email protected]</a>><br>
Reviewed-by: Zhu Lingshan <[email protected]>
|
|
change_num_qps() is still suspending/resuming VQs one by one.
This change switches to parallel suspend/resume.
When increasing the number of queues the flow has changed a bit for
simplicity: the setup_vq() function will always be called before
resume_vqs(). If the VQ is initialized, setup_vq() will exit early. If
the VQ is not initialized, setup_vq() will create it and resume_vqs()
will resume it.
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Tested-by: Lei Yang <[email protected]>
|
|
change_num_qps() has a lot of multiplications by 2 to convert
the number of VQ pairs to number of VQs. This patch simplifies
the code by doing the VQP -> VQ count conversion at the beginning
in a variable.
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Tested-by: Lei Yang <[email protected]>
|
|
Unregistering notifiers is a costly operation. Instead of removing
the notifiers during device suspend and adding them back at resume,
simply ignore the call when the device is suspended.
At resume time call queue_link_work() to make sure that the device state
is propagated in case there were changes.
For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM,
32 CPUs x 2 threads per core), the device suspend time is reduced from
~13 ms to ~2.5 ms.
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Tested-by: Lei Yang <[email protected]>
|
|
Currently device resume works on vqs serially. Building up on previous
changes that converted vq operations to the async api, this patch
parallelizes the device resume.
For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM,
32 CPUs x 2 threads per core), the device resume time is reduced from
~16 ms to ~4.5 ms.
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Tested-by: Lei Yang <[email protected]>
|
|
Currently device suspend works on vqs serially. Building up on previous
changes that converted vq operations to the async api, this patch
parallelizes the device suspend:
1) Suspend all active vqs parallel.
2) Query suspended vqs in parallel.
For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM,
32 CPUs x 2 threads per core), the device suspend time is reduced from
~37 ms to ~13 ms.
A later patch will remove the link unregister operation which will make
it even faster.
Signed-off-by: Dragos Tatulea <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Tested-by: Lei Yang <[email protected]>
|