| Age | Commit message (Collapse) | Author | Files | Lines |
|
This commit adds support for enabling IP defrag using pre-existing
netfilter defrag support. Basically all the flag does is bump a refcnt
while the link the active. Checks are also added to ensure the prog
requesting defrag support is run _after_ netfilter defrag hooks.
We also take care to avoid any issues w.r.t. module unloading -- while
defrag is active on a link, the module is prevented from unloading.
Signed-off-by: Daniel Xu <[email protected]>
Reviewed-by: Florian Westphal <[email protected]>
Link: https://lore.kernel.org/r/5cff26f97e55161b7d56b09ddcf5f8888a5add1d.1689970773.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
accept_ra_min_rtr_lft only considered the lifetime of the default route
and discarded entire RAs accordingly.
This change renames accept_ra_min_rtr_lft to accept_ra_min_lft, and
applies the value to individual RA sections; in particular, router
lifetime, PIO preferred lifetime, and RIO lifetime. If any of those
lifetimes are lower than the configured value, the specific RA section
is ignored.
In order for the sysctl to be useful to Android, it should really apply
to all lifetimes in the RA, since that is what determines the minimum
frequency at which RAs must be processed by the kernel. Android uses
hardware offloads to drop RAs for a fraction of the minimum of all
lifetimes present in the RA (some networks have very frequent RAs (5s)
with high lifetimes (2h)). Despite this, we have encountered networks
that set the router lifetime to 30s which results in very frequent CPU
wakeups. Instead of disabling IPv6 (and dropping IPv6 ethertype in the
WiFi firmware) entirely on such networks, it seems better to ignore the
misconfigured routers while still processing RAs from other IPv6 routers
on the same network (i.e. to support IoT applications).
The previous implementation dropped the entire RA based on router
lifetime. This turned out to be hard to expand to the other lifetimes
present in the RA in a consistent manner; dropping the entire RA based
on RIO/PIO lifetimes would essentially require parsing the whole thing
twice.
Fixes: 1671bcfd76fd ("net: add sysctl accept_ra_min_rtr_lft")
Cc: Lorenzo Colitti <[email protected]>
Signed-off-by: Patrick Rohr <[email protected]>
Reviewed-by: Maciej Żenczykowski <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- A fix for a performance problem in QubesOS, adding a way to drain the
queue of grants experiencing delayed unmaps faster
- A patch enabling the use of static event channels from user mode,
which was omitted when introducing supporting static event channels
- A fix for a problem where Xen related code didn't check properly for
running in a Xen environment, resulting in a WARN splat
* tag 'for-linus-6.5a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: speed up grant-table reclaim
xen/evtchn: Introduce new IOCTL to bind static evtchn
xenbus: check xen_domain in xenbus_probe_initcall
|
|
Pull block fixes from Jens Axboe:
"A few fixes that should go into the current kernel release, mainly:
- Set of fixes for dasd (Stefan)
- Handle interruptible waits returning because of a signal for ublk
(Ming)"
* tag 'block-6.5-2023-07-28' of git://git.kernel.dk/linux:
ublk: return -EINTR if breaking from waiting for existed users in DEL_DEV
ublk: fail to recover device if queue setup is interrupted
ublk: fail to start device if queue setup is interrupted
block: Fix a source code comment in include/uapi/linux/blkzoned.h
s390/dasd: print copy pair message only for the correct error
s390/dasd: fix hanging device after request requeue
s390/dasd: use correct number of retries for ERP requests
s390/dasd: fix hanging device after quiesce/resume
|
|
Also add support to pass topdir to ynl-regen.sh (Jakub) and call
it from the makefile to update the UAPI headers.
Signed-off-by: Stanislav Fomichev <[email protected]>
Co-developed-by: Jakub Kicinski <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Now both the physical path and the emulated path can support an IO page
table replacement. Call iommufd_device_replace/iommufd_access_replace(),
when vdev->iommufd_attached is true.
Also update the VFIO_DEVICE_ATTACH_IOMMUFD_PT kdoc in the uAPI header.
Link: https://lore.kernel.org/r/b5f01956ff161f76aa52c95b0fa1ad6eaca95c4a.1690523699.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Alex Williamson <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Typical misuse of
nla_parse_nested(array, XXX_MAX, ...);
array must be declared as
struct nlattr *array[XXX_MAX + 1];
v2: Based on feedbacks from Ido Schimmel and Zahari Doychev,
I also changed TCA_FLOWER_KEY_CFM_OPT_MAX and cfm_opt_policy
definitions.
syzbot reported:
BUG: KASAN: stack-out-of-bounds in __nla_validate_parse+0x136/0x2bd0 lib/nlattr.c:588
Write of size 32 at addr ffffc90003a0ee20 by task syz-executor296/5014
CPU: 0 PID: 5014 Comm: syz-executor296 Not tainted 6.5.0-rc2-syzkaller-00307-gd192f5382581 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:364 [inline]
print_report+0x163/0x540 mm/kasan/report.c:475
kasan_report+0x175/0x1b0 mm/kasan/report.c:588
kasan_check_range+0x27e/0x290 mm/kasan/generic.c:187
__asan_memset+0x23/0x40 mm/kasan/shadow.c:84
__nla_validate_parse+0x136/0x2bd0 lib/nlattr.c:588
__nla_parse+0x40/0x50 lib/nlattr.c:700
nla_parse_nested include/net/netlink.h:1262 [inline]
fl_set_key_cfm+0x1e3/0x440 net/sched/cls_flower.c:1718
fl_set_key+0x2168/0x6620 net/sched/cls_flower.c:1884
fl_tmplt_create+0x1fe/0x510 net/sched/cls_flower.c:2666
tc_chain_tmplt_add net/sched/cls_api.c:2959 [inline]
tc_ctl_chain+0x131d/0x1ac0 net/sched/cls_api.c:3068
rtnetlink_rcv_msg+0x82b/0xf50 net/core/rtnetlink.c:6424
netlink_rcv_skb+0x1df/0x430 net/netlink/af_netlink.c:2549
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x7c3/0x990 net/netlink/af_netlink.c:1365
netlink_sendmsg+0xa2a/0xd60 net/netlink/af_netlink.c:1914
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg net/socket.c:748 [inline]
____sys_sendmsg+0x592/0x890 net/socket.c:2494
___sys_sendmsg net/socket.c:2548 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2577
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f54c6150759
Code: 48 83 c4 28 c3 e8 d7 19 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffe06c30578 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f54c619902d RCX: 00007f54c6150759
RDX: 0000000000000000 RSI: 0000000020000280 RDI: 0000000000000003
RBP: 00007ffe06c30590 R08: 0000000000000000 R09: 00007ffe06c305f0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f54c61c35f0
R13: 00007ffe06c30778 R14: 0000000000000001 R15: 0000000000000001
</TASK>
The buggy address belongs to stack of task syz-executor296/5014
and is located at offset 32 in frame:
fl_set_key_cfm+0x0/0x440 net/sched/cls_flower.c:374
This frame has 1 object:
[32, 56) 'nla_cfm_opt'
The buggy address belongs to the virtual mapping at
[ffffc90003a08000, ffffc90003a11000) created by:
copy_process+0x5c8/0x4290 kernel/fork.c:2330
Fixes: 7cfffd5fed3e ("net: flower: add support for matching cfm fields")
Reported-by: syzbot <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Simon Horman <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Reviewed-by: Zahari Doychev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Add interpreter/jit support for new sign-extension load insns
which adds a new mode (BPF_MEMSX).
Also add verifier support to recognize these insns and to
do proper verification with new insns. In verifier, besides
to deduce proper bounds for the dst_reg, probed memory access
is also properly handled.
Acked-by: Eduard Zingerman <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts or adjacent changes.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
This registers the new fchmodat2 syscall in most places as nuber 452,
with alpha being the exception where it's 562. I found all these sites
by grepping for fspick, which I assume has found me everything.
Signed-off-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Alexey Gladkov <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Message-Id: <a677d521f048e4ca439e7080a5328f21eb8e960e.1689092120.git.legion@kernel.org>
Signed-off-by: Christian Brauner <[email protected]>
|
|
This allows userspace to manually create HWPTs on IOAS's and then use
those HWPTs as inputs to iommufd_device_attach/replace().
Following series will extend this to allow creating iommu_domains with
driver specific parameters.
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Lu Baolu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Older API PERF_MEM_LVL_UNC can be replaced by PERF_MEM_LVLNUM_UNC.
Signed-off-by: Ravi Bangoria <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Xen 4.17 supports the creation of static evtchns. To allow user space
application to bind static evtchns introduce new ioctl
"IOCTL_EVTCHN_BIND_STATIC". Existing IOCTL doing more than binding
that’s why we need to introduce the new IOCTL to only bind the static
event channels.
Static evtchns to be available for use during the lifetime of the
guest. When the application exits, __unbind_from_irq() ends up being
called from release() file operations because of that static evtchns
are getting closed. To avoid closing the static event channel, add the
new bool variable "is_static" in "struct irq_info" to mark the event
channel static when creating the event channel to avoid closing the
static evtchn.
Also, take this opportunity to remove the open-coded version of the
evtchn close in drivers/xen/evtchn.c file and use xen_evtchn_close().
Signed-off-by: Rahul Singh <[email protected]>
Reviewed-by: Oleksandr Tyshchenko <[email protected]>
Acked-by: Stefano Stabellini <[email protected]>
Link: https://lore.kernel.org/r/ae7329bf1713f83e4aad4f3fa0f316258c40a3e9.1689677042.git.rahul.singh@arm.com
Signed-off-by: Juergen Gross <[email protected]>
|
|
syzkaller found a warning in packet_getname() [0], where we try to
copy 16 bytes to sockaddr_ll.sll_addr[8].
Some devices (ip6gre, vti6, ip6tnl) have 16 bytes address expressed
by struct in6_addr. Also, Infiniband has 32 bytes as MAX_ADDR_LEN.
The write seems to overflow, but actually not since we use struct
sockaddr_storage defined in __sys_getsockname() and its size is 128
(_K_SS_MAXSIZE) bytes. Thus, we have sufficient room after sll_addr[]
as __data[].
To avoid the warning, let's add a flex array member union-ed with
sll_addr.
Another option would be to use strncpy() and limit the copied length
to sizeof(sll_addr), but it will return the partial address and break
an application that passes sockaddr_storage to getsockname().
[0]:
memcpy: detected field-spanning write (size 16) of single field "sll->sll_addr" at net/packet/af_packet.c:3604 (size 8)
WARNING: CPU: 0 PID: 255 at net/packet/af_packet.c:3604 packet_getname+0x25c/0x3a0 net/packet/af_packet.c:3604
Modules linked in:
CPU: 0 PID: 255 Comm: syz-executor750 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #4
Hardware name: linux,dummy-virt (DT)
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : packet_getname+0x25c/0x3a0 net/packet/af_packet.c:3604
lr : packet_getname+0x25c/0x3a0 net/packet/af_packet.c:3604
sp : ffff800089887bc0
x29: ffff800089887bc0 x28: ffff000010f80f80 x27: 0000000000000003
x26: dfff800000000000 x25: ffff700011310f80 x24: ffff800087d55000
x23: dfff800000000000 x22: ffff800089887c2c x21: 0000000000000010
x20: ffff00000de08310 x19: ffff800089887c20 x18: ffff800086ab1630
x17: 20646c6569662065 x16: 6c676e697320666f x15: 0000000000000001
x14: 1fffe0000d56d7ca x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : 3e60944c3da92b00
x8 : 3e60944c3da92b00 x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffff8000898874f8 x4 : ffff800086ac99e0 x3 : ffff8000803f8808
x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000000
Call trace:
packet_getname+0x25c/0x3a0 net/packet/af_packet.c:3604
__sys_getsockname+0x168/0x24c net/socket.c:2042
__do_sys_getsockname net/socket.c:2057 [inline]
__se_sys_getsockname net/socket.c:2054 [inline]
__arm64_sys_getsockname+0x7c/0x94 net/socket.c:2054
__invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
invoke_syscall+0x98/0x2c0 arch/arm64/kernel/syscall.c:52
el0_svc_common+0x134/0x240 arch/arm64/kernel/syscall.c:139
do_el0_svc+0x64/0x198 arch/arm64/kernel/syscall.c:188
el0_svc+0x2c/0x7c arch/arm64/kernel/entry-common.c:647
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:665
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591
Fixes: df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3")
Reported-by: syzkaller <[email protected]>
Suggested-by: Kees Cook <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
sockets. This means we can't use the helper to steer traffic to Envoy,
which configures SO_REUSEPORT on its sockets. In turn, we're blocked
from removing TPROXY from our setup.
The reason that bpf_sk_assign refuses such sockets is that the
bpf_sk_lookup helpers don't execute SK_REUSEPORT programs. Instead,
one of the reuseport sockets is selected by hash. This could cause
dispatch to the "wrong" socket:
sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
helpers unfortunately. In the tc context, L2 headers are at the start
of the skb, while SK_REUSEPORT expects L3 headers instead.
Instead, we execute the SK_REUSEPORT program when the assigned socket
is pulled out of the skb, further up the stack. This creates some
trickiness with regards to refcounting as bpf_sk_assign will put both
refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
freed. We can infer that the sk_assigned socket is RCU freed if the
reuseport lookup succeeds, but convincing yourself of this fact isn't
straight forward. Therefore we defensively check refcounting on the
sk_assign sock even though it's probably not required in practice.
Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Fixes: cf7fbe660f2d ("bpf: Add socket assign support")
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Joe Stringer <[email protected]>
Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Lorenz Bauer <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
This adds ioctl for userspace to attach device cdev fd to and detach
from IOAS/hw_pagetable managed by iommufd.
VFIO_DEVICE_ATTACH_IOMMUFD_PT: attach vfio device to IOAS or hw_pagetable
managed by iommufd. Attach can be undo
by VFIO_DEVICE_DETACH_IOMMUFD_PT or device
fd close.
VFIO_DEVICE_DETACH_IOMMUFD_PT: detach vfio device from the current attached
IOAS or hw_pagetable managed by iommufd.
Reviewed-by: Jason Gunthorpe <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Tested-by: Matthew Rosato <[email protected]>
Tested-by: Yanting Jiang <[email protected]>
Tested-by: Shameer Kolothum <[email protected]>
Tested-by: Terrence Xu <[email protected]>
Tested-by: Zhenzhong Duan <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alex Williamson <[email protected]>
|
|
This adds ioctl for userspace to bind device cdev fd to iommufd.
VFIO_DEVICE_BIND_IOMMUFD: bind device to an iommufd, hence gain DMA
control provided by the iommufd. open_device
op is called after bind_iommufd op.
Tested-by: Nicolin Chen <[email protected]>
Tested-by: Matthew Rosato <[email protected]>
Tested-by: Yanting Jiang <[email protected]>
Tested-by: Shameer Kolothum <[email protected]>
Tested-by: Terrence Xu <[email protected]>
Tested-by: Zhenzhong Duan <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alex Williamson <[email protected]>
|
|
This defines KVM_DEV_VFIO_FILE* and make alias with KVM_DEV_VFIO_GROUP*.
Old userspace uses KVM_DEV_VFIO_GROUP* works as well.
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Tested-by: Terrence Xu <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Tested-by: Matthew Rosato <[email protected]>
Tested-by: Yanting Jiang <[email protected]>
Tested-by: Shameer Kolothum <[email protected]>
Tested-by: Zhenzhong Duan <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alex Williamson <[email protected]>
|
|
This is the way user to invoke hot-reset for the devices opened by cdev
interface. User should check the flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED
in the output of VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl before doing
hot-reset for cdev devices.
Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Tested-by: Yanting Jiang <[email protected]>
Tested-by: Zhenzhong Duan <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alex Williamson <[email protected]>
|
|
This allows VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl use the iommufd_ctx
of the cdev device to check the ownership of the other affected devices.
When VFIO_DEVICE_GET_PCI_HOT_RESET_INFO is called on an IOMMUFD managed
device, the new flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID is reported to indicate
the values returned are IOMMUFD devids rather than group IDs as used when
accessing vfio devices through the conventional vfio group interface.
Additionally the flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED will be reported
in this mode if all of the devices affected by the hot-reset are owned by
either virtue of being directly bound to the same iommufd context as the
calling device, or implicitly owned via a shared IOMMU group.
Suggested-by: Jason Gunthorpe <[email protected]>
Suggested-by: Alex Williamson <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Tested-by: Yanting Jiang <[email protected]>
Tested-by: Zhenzhong Duan <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alex Williamson <[email protected]>
|
|
Fix the symbolic names for zone conditions in the blkzoned.h header
file.
Cc: Hannes Reinecke <[email protected]>
Cc: Damien Le Moal <[email protected]>
Fixes: 6a0cb1bc106f ("block: Implement support for zoned block devices")
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Backmerging to get v6.5-rc2.
Signed-off-by: Thomas Zimmermann <[email protected]>
|
|
This change adds a new sysctl accept_ra_min_rtr_lft to specify the
minimum acceptable router lifetime in an RA. If the received RA router
lifetime is less than the configured value (and not 0), the RA is
ignored.
This is useful for mobile devices, whose battery life can be impacted
by networks that configure RAs with a short lifetime. On such networks,
the device should never gain IPv6 provisioning and should attempt to
drop RAs via hardware offload, if available.
Signed-off-by: Patrick Rohr <[email protected]>
Cc: Maciej Żenczykowski <[email protected]>
Cc: Lorenzo Colitti <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch adds the capability to filter messages sent by the proc
connector on the event type supplied in the message from the client
to the connector. The client can register to listen for an event type
given in struct proc_input.
This event based filteting will greatly enhance performance - handling
8K exits takes about 70ms, whereas 8K-forks + 8K-exits takes about 150ms
& handling 8K-forks + 8K-exits + 8K-execs takes 200ms. There are currently
9 different types of events, and we need to listen to all of them. Also,
measuring the time using pidfds for monitoring 8K process exits took
much longer - 200ms, as compared to 70ms using only exit notifications of
proc connector.
We also add a new event type - PROC_EVENT_NONZERO_EXIT, which is
only sent by kernel to a listening application when any process exiting,
has a non-zero exit status. This will help the clients like Oracle DB,
where a monitoring process wants notfications for non-zero process exits
so it can cleanup after them.
This kind of a new event could also be useful to other applications like
Google's lmkd daemon, which needs a killed process's exit notification.
The patch takes care that existing clients using old mechanism of not
sending the event type work without any changes.
cn_filter function checks to see if the event type being notified via
proc connector matches the event type requested by client, before
sending(matches) or dropping(does not match) a packet.
Signed-off-by: Anjali Kulkarni <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The current proc connector code has the foll. bugs - if there are more
than one listeners for the proc connector messages, and one of them
deregisters for listening using PROC_CN_MCAST_IGNORE, they will still get
all proc connector messages, as long as there is another listener.
Another issue is if one client calls PROC_CN_MCAST_LISTEN, and another one
calls PROC_CN_MCAST_IGNORE, then both will end up not getting any messages.
This patch adds filtering and drops packet if client has sent
PROC_CN_MCAST_IGNORE. This data is stored in the client socket's
sk_user_data. In addition, we only increment or decrement
proc_event_num_listeners once per client. This fixes the above issues.
cn_release is the release function added for NETLINK_CONNECTOR. It uses
the newly added netlink_release function added to netlink_sock. It will
free sk_user_data.
Signed-off-by: Anjali Kulkarni <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add driver notify uapi for application notifying
the driver about the doorbell FIFO congestion.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Chandramohan Akula <[email protected]>
Signed-off-by: Selvin Xavier <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Update the alloc_page uapi functionality for handling the
mapping of doorbell pacing shared page and bar address.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Chandramohan Akula <[email protected]>
Signed-off-by: Selvin Xavier <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Report the pacing capability to the user applications.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Chandramohan Akula <[email protected]>
Signed-off-by: Selvin Xavier <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts or adjacent changes.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
This patch implements dlm plock F_SETLKW interruption feature. If a
blocking posix lock request got interrupted in user space by a signal a
cancellation request for a non granted lock request to the user space
lock manager will be send. The user lock manager answers either with
zero or a negative errno code. A errno of -ENOENT signals that there is
currently no blocking lock request waiting to being granted. In case of
-ENOENT it was probably to late to request a cancellation and the
pending lock got granted. In any error case we will wait until the lock
is being granted as cancellation failed, this causes also that in case
of an older user lock manager returning -EINVAL we will wait as
cancellation is not supported which should be fine. If a user requires
this feature the user should update dlm user space to support lock
request cancellation.
Signed-off-by: Alexander Aring <[email protected]>
Signed-off-by: David Teigland <[email protected]>
|
|
Add registers and definitions to support 1000BASE-T1. This includes the
PCS Control and Status registers (3.2304 and 3.2305) as well as some
missing bits on the PMA/PMD extended ability register (1.18) and PMA/PMD
CTRL (1.2100) register.
Signed-off-by: Stefan Eichenberger <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
After software has authenticated a dynamic boost control request,
it can fetch and set supported parameters using a selection of messages.
Add support for these messages and export the ability to do this to
userspace.
Acked-by: Tom Lendacky <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
As part of the authentication flow for Dynamic Boost Control, the calling
software will need to send a uid used in all of its future
communications.
Add support for another IOCTL call to let userspace software set this up.
Acked-by: Tom Lendacky <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
Dynamic Boost Control is a feature offered on AMD client platforms that
allows software to request and set power or frequency limits.
Only software that has authenticated with the PSP can retrieve or set
these limits.
Create a character device and ioctl for fetching the nonce. This ioctl
supports optionally passing authentication information which will influence
how many calls the nonce is valid for.
Acked-by: Tom Lendacky <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
Introduce a new DRM_IOCTL_SYNCOBJ_EVENTFD IOCTL which signals an
eventfd from a syncobj.
This is useful for Wayland compositors to handle wait-before-submit.
Wayland clients can send a timeline point to the compositor
before the point has materialized yet, then compositors can wait
for the point to materialize via this new IOCTL.
The existing DRM_IOCTL_SYNCOBJ_TIMELINE_WAIT IOCTL is not suitable
because it blocks. Compositors want to integrate the wait with
their poll(2)-based event loop.
Requirements for new uAPI:
- User-space patch: https://gitlab.freedesktop.org/wlroots/wlroots/-/merge_requests/4262
- IGT: https://lists.freedesktop.org/archives/igt-dev/2023-July/057893.html
v2:
- Wait for fence when flags is zero
- Improve documentation (Pekka)
- Rename IOCTL (Christian)
- Fix typo in drm_syncobj_add_eventfd() (Christian)
v3:
- Link user-space + IGT patches
- Add reference from overview docs
v4: fix IOCTL number conflict with GETFB2 (Nicholas Choi, Vitaly Prosyak)
Signed-off-by: Simon Ser <[email protected]>
Reviewed-by: Christian König <[email protected]>
Acked-by: Pekka Paalanen <[email protected]>
Cc: Jason Ekstrand <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Bas Nieuwenhuizen <[email protected]>
Cc: Daniel Stone <[email protected]>
Cc: James Jones <[email protected]>
Cc: Austin Shafer <[email protected]>
Cc: Vitaly Prosyak <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2023-07-19
We've added 45 non-merge commits during the last 3 day(s) which contain
a total of 71 files changed, 7808 insertions(+), 592 deletions(-).
The main changes are:
1) multi-buffer support in AF_XDP, from Maciej Fijalkowski,
Magnus Karlsson, Tirthendu Sarkar.
2) BPF link support for tc BPF programs, from Daniel Borkmann.
3) Enable bpf_map_sum_elem_count kfunc for all program types,
from Anton Protopopov.
4) Add 'owner' field to bpf_rb_node to fix races in shared ownership,
Dave Marchevsky.
5) Prevent potential skb_header_pointer() misuse, from Alexei Starovoitov.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
bpf, net: Introduce skb_pointer_if_linear().
bpf: sync tools/ uapi header with
selftests/bpf: Add mprog API tests for BPF tcx links
selftests/bpf: Add mprog API tests for BPF tcx opts
bpftool: Extend net dump with tcx progs
libbpf: Add helper macro to clear opts structs
libbpf: Add link-based API for tcx
libbpf: Add opts-based attach/detach/query API for tcx
bpf: Add fd-based tcx multi-prog infra with link support
bpf: Add generic attach/detach/query API for multi-progs
selftests/xsk: reset NIC settings to default after running test suite
selftests/xsk: add test for too many frags
selftests/xsk: add metadata copy test for multi-buff
selftests/xsk: add invalid descriptor test for multi-buffer
selftests/xsk: add unaligned mode test for multi-buffer
selftests/xsk: add basic multi-buffer test
selftests/xsk: transmit and receive multi-buffer packets
xsk: add multi-buffer documentation
i40e: xsk: add TX multi-buffer support
ice: xsk: Tx multi-buffer support
...
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"Small but important fixes and a trivial cleanup"
* tag 'fuse-update-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: ioctl: translate ENOSYS in outarg
fuse: revalidate: don't invalidate if interrupted
fuse: Apply flags2 only when userspace set the FUSE_INIT_EXT
fuse: remove duplicate check for nodeid
fuse: add feature flag for expire-only
|
|
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.
[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
This adds a generic layer called bpf_mprog which can be reused by different
attachment layers to enable multi-program attachment and dependency resolution.
In-kernel users of the bpf_mprog don't need to care about the dependency
resolution internals, they can just consume it with few API calls.
The initial idea of having a generic API sparked out of discussion [0] from an
earlier revision of this work where tc's priority was reused and exposed via
BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
as-is for classic tc BPF. The feedback was that priority provides a bad user
experience and is hard to use [1], e.g.:
I cannot help but feel that priority logic copy-paste from old tc, netfilter
and friends is done because "that's how things were done in the past". [...]
Priority gets exposed everywhere in uapi all the way to bpftool when it's
right there for users to understand. And that's the main problem with it.
The user don't want to and don't need to be aware of it, but uapi forces them
to pick the priority. [...] Your cover letter [0] example proves that in
real life different service pick the same priority. They simply don't know
any better. Priority is an unnecessary magic that apps _have_ to pick, so
they just copy-paste and everyone ends up using the same.
The course of the discussion showed more and more the need for a generic,
reusable API where the "same look and feel" can be applied for various other
program types beyond just tc BPF, for example XDP today does not have multi-
program support in kernel, but also there was interest around this API for
improving management of cgroup program types. Such common multi-program
management concept is useful for BPF management daemons or user space BPF
applications coordinating internally about their attachments.
Both from Cilium and Meta side [2], we've collected the following requirements
for a generic attach/detach/query API for multi-progs which has been implemented
as part of this work:
- Support prog-based attach/detach and link API
- Dependency directives (can also be combined):
- BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
- BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
space application does not need CAP_SYS_ADMIN to retrieve foreign fds
via bpf_*_get_fd_by_id()
- BPF_F_LINK flag as {prog,link} toggle
- If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
BPF_F_AFTER will just append for attaching
- Enforced only at attach time
- BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
own infra for replacing their internal prog
- If no flags are set, then it's default append behavior for attaching
- Internal revision counter and optionally being able to pass expected_revision
- User space application can query current state with revision, and pass it
along for attachment to assert current state before doing updates
- Query also gets extension for link_ids array and link_attach_flags:
- prog_ids are always filled with program IDs
- link_ids are filled with link IDs when link was used, otherwise 0
- {prog,link}_attach_flags for holding {prog,link}-specific flags
- Must be easy to integrate/reuse for in-kernel users
The uapi-side changes needed for supporting bpf_mprog are rather minimal,
consisting of the additions of the attachment flags, revision counter, and
expanding existing union with relative_{fd,id} member.
The bpf_mprog framework consists of an bpf_mprog_entry object which holds
an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
structure) is part of bpf_mprog_bundle. Both have been separated, so that
fast-path gets efficient packing of bpf_prog pointers for maximum cache
efficiency. Also, array has been chosen instead of linked list or other
structures to remove unnecessary indirections for a fast point-to-entry in
tc for BPF.
The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
updates the peer bpf_mprog_entry is populated and then just swapped which
avoids additional allocations that could otherwise fail, for example, in
detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
be converted to dynamic allocation if necessary at a point in future.
Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
of tcx which uses this API in the next patch, it piggybacks on rtnl.
An extensive test suite for checking all aspects of this API for prog-based
attach/detach and link API comes as BPF selftests in this series.
Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
management.
[0] https://lore.kernel.org/bpf/[email protected]
[1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
[2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will
carry maximum fragments that underlying ZC driver is able to handle on
TX side. It is going to be included in netlink response only when driver
supports ZC. Any value higher than 1 implies multi-buffer ZC support on
underlying device.
Signed-off-by: Maciej Fijalkowski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
As of now xsk core drops any xdp_buff with data size greater than the
xsk frame_size as set by the af_xdp application. With multi-buffer
support introduced in the next patch xsk core can now split those
buffers into multiple descriptors provided the af_xdp application can
handle them. Such capability of the application needs to be independent
of the xdp_prog's frag support capability since there are cases where
even a single xdp_buffer may need to be split into multiple descriptors
owing to a smaller xsk frame size.
For e.g., with NIC rx_buffer size set to 4kB, a 3kB packet will
constitute of a single buffer and so will be sent as such to AF_XDP layer
irrespective of 'xdp.frags' capability of the XDP program. Now if the xsk
frame size is set to 2kB by the AF_XDP application, then the packet will
need to be split into 2 descriptors if AF_XDP application can handle
multi-buffer, else it needs to be dropped.
Applications can now advertise their frag handling capability to xsk core
so that xsk core can decide if it should drop or split xdp_buffs that
exceed xsk frame size. This is done using a new 'XSK_USE_SG' bind flag
for the xdp socket.
Signed-off-by: Tirthendu Sarkar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Use the 'options' field in xdp_desc as a packet continuity marker. Since
'options' field was unused till now and was expected to be set to 0, the
'eop' descriptor will have it set to 0, while the non-eop descriptors
will have to set it to 1. This ensures legacy applications continue to
work without needing any change for single-buffer packets.
Add helper functions and extend xskq_prod_reserve_desc() to use the
'options' field.
Signed-off-by: Tirthendu Sarkar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Add a new bridge port attribute that allows attaching a nexthop object
ID to an skb that is redirected to a backup bridge port with VLAN
tunneling enabled.
Specifically, when redirecting a known unicast packet, read the backup
nexthop ID from the bridge port that lost its carrier and set it in the
bridge control block of the skb before forwarding it via the backup
port. Note that reading the ID from the bridge port should not result in
a cache miss as the ID is added next to the 'backup_port' field that was
already accessed. After this change, the 'state' field still stays on
the first cache line, together with other data path related fields such
as 'flags and 'vlgrp':
struct net_bridge_port {
struct net_bridge * br; /* 0 8 */
struct net_device * dev; /* 8 8 */
netdevice_tracker dev_tracker; /* 16 0 */
struct list_head list; /* 16 16 */
long unsigned int flags; /* 32 8 */
struct net_bridge_vlan_group * vlgrp; /* 40 8 */
struct net_bridge_port * backup_port; /* 48 8 */
u32 backup_nhid; /* 56 4 */
u8 priority; /* 60 1 */
u8 state; /* 61 1 */
u16 port_no; /* 62 2 */
/* --- cacheline 1 boundary (64 bytes) --- */
[...]
} __attribute__((__aligned__(8)));
When forwarding an skb via a bridge port that has VLAN tunneling
enabled, check if the backup nexthop ID stored in the bridge control
block is valid (i.e., not zero). If so, instead of attaching the
pre-allocated metadata (that only has the tunnel key set), allocate a
new metadata, set both the tunnel key and the nexthop object ID and
attach it to the skb.
By default, do not dump the new attribute to user space as a value of
zero is an invalid nexthop object ID.
The above is useful for EVPN multihoming. When one of the links
composing an Ethernet Segment (ES) fails, traffic needs to be redirected
towards the host via one of the other ES peers. For example, if a host
is multihomed to three different VTEPs, the backup port of each ES link
needs to be set to the VXLAN device and the backup nexthop ID needs to
point to an FDB nexthop group that includes the IP addresses of the
other two VTEPs. The VXLAN driver will extract the ID from the metadata
of the redirected skb, calculate its flow hash and forward it towards
one of the other VTEPs. If the ID does not exist, or represents an
invalid nexthop object, the VXLAN driver will drop the skb. This
relieves the bridge driver from the need to validate the ID.
Signed-off-by: Ido Schimmel <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
As described by Kumar in [0], in shared ownership scenarios it is
necessary to do runtime tracking of {rb,list} node ownership - and
synchronize updates using this ownership information - in order to
prevent races. This patch adds an 'owner' field to struct bpf_list_node
and bpf_rb_node to implement such runtime tracking.
The owner field is a void * that describes the ownership state of a
node. It can have the following values:
NULL - the node is not owned by any data structure
BPF_PTR_POISON - the node is in the process of being added to a data
structure
ptr_to_root - the pointee is a data structure 'root'
(bpf_rb_root / bpf_list_head) which owns this node
The field is initially NULL (set by bpf_obj_init_field default behavior)
and transitions states in the following sequence:
Insertion: NULL -> BPF_PTR_POISON -> ptr_to_root
Removal: ptr_to_root -> NULL
Before a node has been successfully inserted, it is not protected by any
root's lock, and therefore two programs can attempt to add the same node
to different roots simultaneously. For this reason the intermediate
BPF_PTR_POISON state is necessary. For removal, the node is protected
by some root's lock so this intermediate hop isn't necessary.
Note that bpf_list_pop_{front,back} helpers don't need to check owner
before removing as the node-to-be-removed is not passed in as input and
is instead taken directly from the list. Do the check anyways and
WARN_ON_ONCE in this unexpected scenario.
Selftest changes in this patch are entirely mechanical: some BTF
tests have hardcoded struct sizes for structs that contain
bpf_{list,rb}_node fields, those were adjusted to account for the new
sizes. Selftest additions to validate the owner field are added in a
further patch in the series.
[0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2
Signed-off-by: Dave Marchevsky <[email protected]>
Suggested-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.
In many cases, the workflow is fully synchronous. It means a target
process triggers a system call and passes controls to a supervisor
process that handles the system call and returns controls to the target
process. In this context, "synchronous" means that only one process is
running and another one is waiting.
There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
move the wakee to the current CPU. For such synchronous workflows, it
makes context switches a few times faster.
Right now, each interaction takes 12µs. With this patch, it takes about
3µs.
This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
it used to enable the sync mode.
Signed-off-by: Andrei Vagin <[email protected]>
Acked-by: "Peter Zijlstra (Intel)" <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
|
|
Allow usage of IORING_ASYNC_CANCEL_OP through the sync cancelation
API as well.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add IORING_ASYNC_CANCEL_OP flag for cancelation, which allows the
application to target cancelation based on the opcode of the original
request.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add a flag to explicitly match on user_data in the request for
cancelation purposes. This is the default behavior if none of the
other match flags are set, but if we ALSO want to match on user_data,
then this flag can be set.
Signed-off-by: Jens Axboe <[email protected]>
|
|
git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for v6.6:
UAPI Changes:
* fbdev:
* Make fbdev userspace interfaces optional; only leaves the
framebuffer console active
* prime:
* Support dma-buf self-import for all drivers automatically: improves
support for many userspace compositors
Cross-subsystem Changes:
* backlight:
* Fix interaction with fbdev in several drivers
* base: Convert struct platform.remove to return void; part of a larger,
tree-wide effort
* dma-buf: Acquire reservation lock for mmap() in exporters; part
of an on-going effort to simplify locking around dma-bufs
* fbdev:
* Use Linux device instead of fbdev device in many places
* Use deferred-I/O helper macros in various drivers
* i2c: Convert struct i2c from .probe_new to .probe; part of a larger,
tree-wide effort
* video:
* Avoid including <linux/screen_info.h>
Core Changes:
* atomic:
* Improve logging
* prime:
* Remove struct drm_driver.gem_prime_mmap plus driver updates: all
drivers now implement this callback with drm_gem_prime_mmap()
* gem:
* Support execution contexts: provides locking over multiple GEM
objects
* ttm:
* Support init_on_free
* Swapout fixes
Driver Changes:
* accel:
* ivpu: MMU updates; Support debugfs
* ast:
* Improve device-model detection
* Cleanups
* bridge:
* dw-hdmi: Improve support for YUV420 bus format
* dw-mipi-dsi: Fix enable/disable of DSI controller
* lt9611uxc: Use MODULE_FIRMWARE()
* ps8640: Remove broken EDID code
* samsung-dsim: Fix command transfer
* tc358764: Handle HS/VS polarity; Use BIT() macro; Various cleanups
* Cleanups
* ingenic:
* Kconfig REGMAP fixes
* loongson:
* Support display controller
* mgag200:
* Minor fixes
* mxsfb:
* Support disabling overlay planes
* nouveau:
* Improve VRAM detection
* Various fixes and cleanups
* panel:
* panel-edp: Support AUO B116XAB01.4
* Support Visionox R66451 plus DT bindings
* Cleanups
* ssd130x:
* Support per-controller default resolution plus DT bindings
* Reduce memory-allocation overhead
* Cleanups
* tidss:
* Support TI AM625 plus DT bindings
* Implement new connector model plus driver updates
* vkms
* Improve write-back support
* Documentation fixes
Signed-off-by: Daniel Vetter <[email protected]>
From: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/20230713090830.GA23281@linux-uq9g
|
|
Add RPMSG_GET_OUTGOING_FLOWCONTROL and RPMSG_SET_INCOMING_FLOWCONTROL
IOCTL support for rpmsg char device nodes to get/set the low level
transport signals.
Signed-off-by: Chris Lew <[email protected]>
Signed-off-by: Deepak Kumar Singh <[email protected]>
Signed-off-by: Sarannya S <[email protected]>
Acked-by: Arnaud Pouliquen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bjorn Andersson <[email protected]>
|