Age | Commit message (Collapse) | Author | Files | Lines |
|
Improve the readability of irqdomain debugging information in debugfs by
printing the flags field of domain files as human-readable strings instead
of a raw bitmask, which aligned with the existing style used for irqchip
flags in the irq debug files.
Before:
#cat :cpus:cpu@0:interrupt-controller
name: :cpus:cpu@0:interrupt-controller
size: 0
mapped: 2
flags: 0x00000003
After:
#cat :cpus:cpu@0:interrupt-controller
name: :cpus:cpu@0:interrupt-controller
size: 0
mapped: 3
flags: 0x00000003
IRQ_DOMAIN_FLAG_HIERARCHY
IRQ_DOMAIN_NAME_ALLOCATED
Signed-off-by: Jinjie Ruan <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Interrupts which have no action and chained interrupts can be
ignored due to the following reasons (as per tglx's comment):
1) Interrupts which have no action are completely uninteresting as
there is no real information attached.
2) Chained interrupts do not have a count at all.
So there is no point to evaluate the number of accounted interrupts before
checking for non-requested or chained interrupts.
Remove the any_count logic and simply check whether the interrupt
descriptor has the kstat_irqs member populated.
[ tglx: Adapted to upstream changes ]
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Adrian Huang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Jiwei Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/87h6f0knau.ffs@tglx/
|
|
PPS (Pulse Per Second) generates a hardware pulse every second based on
CLOCK_REALTIME. This works fine when the pulse is generated in software
from a hrtimer callback function.
For hardware which generates the pulse by programming a timer it is
required to convert CLOCK_REALTIME to the underlying hardware clock.
The X86 Timed IO device is based on the Always Running Timer (ART), which
is the base clock of the TSC, which is usually the system clocksource on
X86.
The core code already has functionality to convert base clock timestamps to
system clocksource timestamps, but there is no support for converting the
other way around.
Provide the required functionality to support such devices in a generic
way to avoid code duplication in drivers:
1) ktime_real_to_base_clock() to convert a CLOCK_REALTIME timestamp to a
base clock timestamp
2) timekeeping_clocksource_has_base() to allow drivers to validate that
the system clocksource is based on a particular clocksource ID.
[ tglx: Simplify timekeeping_clocksource_has_base() and add missing READ_ONCE() ]
Co-developed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Co-developed-by: Christopher S. Hall <[email protected]>
Signed-off-by: Christopher S. Hall <[email protected]>
Signed-off-by: Lakshmi Sowjanya D <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Hardware time stamps like provided by PTP clock implementations are based
on a clock which feeds both the PCIe device and the system clock. For
further processing the underlying hardwarre clock timestamp must be
converted to the system clock.
Right now this requires drivers to invoke an architecture specific
conversion function, e.g. to convert the ART (Always Running Timer)
timestamp to a TSC timestamp.
As the system clock is aware of the underlying base clock, this can be
moved to the core code by providing a base clock property for the system
clock which contains the conversion factors and assigning a clocksource ID
to the base clock.
Add the required data structures and the conversion infrastructure in the
core code to prepare for converting X86 and the related PTP drivers over.
[ tglx: Added a missing READ_ONCE(). Massaged change log ]
Co-developed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Co-developed-by: Christopher S. Hall <[email protected]>
Signed-off-by: Christopher S. Hall <[email protected]>
Signed-off-by: Lakshmi Sowjanya D <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Fix the make W=1 warnings:
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/time/clocksource-wdtest.o
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/time/test_udelay.o
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/time/time_test.o
Signed-off-by: Jeff Johnson <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Paul E. McKenney <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In the cpuset_css_online(), clearing the CS_SCHED_LOAD_BALANCE bit
of cs->flags is guarded by callback_lock and cpuset_mutex. There is
no problem with itself, because it is consistent with the description
of there two global lock at the beginning of this file. However, since
the operation of checking, setting and clearing the flag bit is atomic,
protection of callback_lock is unnecessary here, see CS_SPREAD_*. so
to make it more consistent with the other code, move the operation
outside the critical section of callback_lock.
No functional changes intended.
Signed-off-by: Xiu Jianfeng <[email protected]>
Acked-by: Waiman Long <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix a Kconfig bug regarding comparisons to 'm' or 'n'
- Replace missed $(srctree)/$(src)
- Fix unneeded kallsyms step 3
- Remove incorrect "compatible" properties from image nodes in
image.fit
- Improve gen_kheaders.sh
- Fix 'make dt_binding_check'
- Clean up unnecessary code
* tag 'kbuild-fixes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
dt-bindings: kbuild: Fix dt_binding_check on unconfigured build
kheaders: use `command -v` to test for existence of `cpio`
kheaders: explicitly define file modes for archived headers
scripts/make_fit: Drop fdt image entry compatible string
kbuild: remove a stale comment about cleaning in link-vmlinux.sh
kbuild: fix short log for AS in link-vmlinux.sh
kbuild: change scripts/mksysmap into sed script
kbuild: avoid unneeded kallsyms step 3
kbuild: scripts/gdb: Replace missed $(srctree)/$(src) w/ $(src)
kconfig: remove redundant check in expr_join_or()
kconfig: fix comparison to constant symbols, 'm', 'n'
kconfig: remove unused expr_is_no()
|
|
The bpf_session_cookie is unavailable for !CONFIG_FPROBE as reported
by Sebastian [1].
To fix that we remove CONFIG_FPROBE ifdef for session kfuncs, which
is fine, because there's filter for session programs.
Then based on bpf_trace.o dependency:
obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o
we add bpf_session_cookie BTF_ID in special_kfunc_set list dependency
on CONFIG_BPF_EVENTS.
[1] https://lore.kernel.org/bpf/[email protected]/T/#m71c6d5ec71db2967288cb79acedc15cc5dbfeec5
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>
Suggested-by: Alexei Starovoitov <[email protected]>
Fixes: 5c919acef8514 ("bpf: Add support for kprobe session cookie")
Signed-off-by: Jiri Olsa <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/ti/icssg/icssg_classifier.c
abd5576b9c57 ("net: ti: icssg-prueth: Add support for ICSSG switch firmware")
56a5cf538c3f ("net: ti: icssg-prueth: Fix start counter for ft1 filter")
https://lore.kernel.org/all/[email protected]/
No other adjacent changes.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- dma-mapping benchmark error handling fixes (Fedor Pchelkin)
- correct a config symbol reference in the DMA API documentation (Lukas
Bulwahn)
* tag 'dma-mapping-6.10-2024-05-31' of git://git.infradead.org/users/hch/dma-mapping:
Documentation/core-api: correct reference to SWIOTLB_DYNAMIC
dma-mapping: benchmark: handle NUMA_NO_NODE correctly
dma-mapping: benchmark: fix node id validation
dma-mapping: benchmark: avoid needless copy_to_user if benchmark fails
dma-mapping: benchmark: fix up kthread-related error handling
|
|
bpf_link_inc_not_zero() will be used by kernel modules. We will use it in
bpf_testmod.c later.
Signed-off-by: Kui-Feng Lee <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
Add epoll support to bpf struct_ops links to trigger EPOLLHUP event upon
detachment.
This patch implements the "poll" of the "struct file_operations" for BPF
links and introduces a new "poll" operator in the "struct bpf_link_ops". By
implementing "poll" of "struct bpf_link_ops" for the links of struct_ops,
the file descriptor of a struct_ops link can be added to an epoll file
descriptor to receive EPOLLHUP events.
Signed-off-by: Kui-Feng Lee <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
Implement the detach callback in bpf_link_ops for struct_ops so that user
programs can detach a struct_ops link. The subsystems that struct_ops
objects are registered to can also use this callback to detach the links
being passed to them.
Signed-off-by: Kui-Feng Lee <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
Pass an additional pointer of bpf_struct_ops_link to callback function reg,
unreg, and update provided by subsystems defined in bpf_struct_ops. A
bpf_struct_ops_map can be registered for multiple links. Passing a pointer
of bpf_struct_ops_link helps subsystems to distinguish them.
This pointer will be used in the later patches to let the subsystem
initiate a detachment on a link that was registered to it previously.
Signed-off-by: Kui-Feng Lee <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
Fix the 'make C=1' warning:
kernel/scftorture.c:71:6: warning: symbol 'torture_type' was not declared. Should it be static?
Signed-off-by: Jeff Johnson <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Fix the 'make W=1' warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/scftorture.o
Signed-off-by: Jeff Johnson <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Fix the 'make W=1' warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/locking/locktorture.o
Signed-off-by: Jeff Johnson <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Fix the 'make W=1' warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/torture.o
Signed-off-by: Jeff Johnson <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bpf and netfilter.
Current release - regressions:
- gro: initialize network_offset in network layer
- tcp: reduce accepted window in NEW_SYN_RECV state
Current release - new code bugs:
- eth: mlx5e: do not use ptp structure for tx ts stats when not
initialized
- eth: ice: check for unregistering correct number of devlink params
Previous releases - regressions:
- bpf: Allow delete from sockmap/sockhash only if update is allowed
- sched: taprio: extend minimum interval restriction to entire cycle
too
- netfilter: ipset: add list flush to cancel_gc
- ipv4: fix address dump when IPv4 is disabled on an interface
- sock_map: avoid race between sock_map_close and sk_psock_put
- eth: mlx5: use mlx5_ipsec_rx_status_destroy to correctly delete
status rules
Previous releases - always broken:
- core: fix __dst_negative_advice() race
- bpf:
- fix multi-uprobe PID filtering logic
- fix pkt_type override upon netkit pass verdict
- netfilter: tproxy: bail out if IP has been disabled on the device
- af_unix: annotate data-race around unix_sk(sk)->addr
- eth: mlx5e: fix UDP GSO for encapsulated packets
- eth: idpf: don't enable NAPI and interrupts prior to allocating Rx
buffers
- eth: i40e: fully suspend and resume IO operations in EEH case
- eth: octeontx2-pf: free send queue buffers incase of leaf to inner
- eth: ipvlan: dont Use skb->sk in ipvlan_process_v{4,6}_outbound"
* tag 'net-6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
netdev: add qstat for csum complete
ipvlan: Dont Use skb->sk in ipvlan_process_v{4,6}_outbound
net: ena: Fix redundant device NUMA node override
ice: check for unregistering correct number of devlink params
ice: fix 200G PHY types to link speed mapping
i40e: Fully suspend and resume IO operations in EEH case
i40e: factoring out i40e_suspend/i40e_resume
e1000e: move force SMBUS near the end of enable_ulp function
net: dsa: microchip: fix RGMII error in KSZ DSA driver
ipv4: correctly iterate over the target netns in inet_dump_ifaddr()
net: fix __dst_negative_advice() race
nfc/nci: Add the inconsistency check between the input data length and count
MAINTAINERS: dwmac: starfive: update Maintainer
net/sched: taprio: extend minimum interval restriction to entire cycle too
net/sched: taprio: make q->picos_per_byte available to fill_sched_entry()
netfilter: nft_fib: allow from forward/input without iif selector
netfilter: tproxy: bail out if IP has been disabled on the device
netfilter: nft_payload: skbuff vlan metadata mangle support
net: ti: icssg-prueth: Fix start counter for ft1 filter
sock_map: avoid race between sock_map_close and sk_psock_put
...
|
|
Add three new kfuncs for the bits iterator:
- bpf_iter_bits_new
Initialize a new bits iterator for a given memory area. Due to the
limitation of bpf memalloc, the max number of words (8-byte units) that
can be iterated over is limited to (4096 / 8).
- bpf_iter_bits_next
Get the next bit in a bpf_iter_bits
- bpf_iter_bits_destroy
Destroy a bpf_iter_bits
The bits iterator facilitates the iteration of the bits of a memory area,
such as cpumask. It can be used in any context and on any address.
Signed-off-by: Yafang Shao <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Commit 13e1df09284d ("kheaders: explicitly validate existence of cpio
command") added an explicit check for `cpio` using `type`.
However, `type` in `dash` (which is used in some popular distributions
and base images as the shell script runner) prints the missing message
to standard output, and thus no error is printed:
$ bash -c 'type missing >/dev/null'
bash: line 1: type: missing: not found
$ dash -c 'type missing >/dev/null'
$
For instance, this issue may be seen by loongarch builders, given its
defconfig enables CONFIG_IKHEADERS since commit 9cc1df421f00 ("LoongArch:
Update Loongson-3 default config file").
Therefore, use `command -v` instead to have consistent behavior, and
take the chance to provide a more explicit error.
Fixes: 13e1df09284d ("kheaders: explicitly validate existence of cpio command")
Signed-off-by: Miguel Ojeda <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
Build environments might be running with different umask settings
resulting in indeterministic file modes for the files contained in
kheaders.tar.xz. The file itself is served with 444, i.e. world
readable. Archive the files explicitly with 744,a+X to improve
reproducibility across build environments.
--mode=0444 is not suitable as directories need to be executable. Also,
444 makes it hard to delete all the readonly files after extraction.
Cc: [email protected]
Signed-off-by: Matthias Maennich <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- uprobes: prevent mutex_lock() under rcu_read_lock().
Recent changes moved uprobe_cpu_buffer preparation which involves
mutex_lock(), under __uprobe_trace_func() which is called inside
rcu_read_lock().
Fix it by moving uprobe_cpu_buffer preparation outside of
__uprobe_trace_func()
- kprobe-events: handle the error case of btf_find_struct_member()
* tag 'probes-fixes-v6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: fix error check in parse_btf_field()
uprobes: prevent mutex_lock() under rcu_read_lock()
|
|
PCI bridge window logic needs to find out in advance to the actual
allocation if there is an empty space big enough to fit the window.
Export find_resource_space() for the purpose. Also move the struct
resource_constraint into generic header to be able to use the new
interface.
Link: https://lore.kernel.org/r/[email protected]
Tested-by: Lidong Wang <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Mika Westerberg <[email protected]>
|
|
allocate_resource() accepts ->alignf() callback to perform custom alignment
beyond constraint->align. If alignf is NULL, simple_align_resource() is
used which only returns avail->start (no change).
Using avail->start directly is natural and can be done with a conditional
in __find_resource_space() instead which avoids unnecessarily using
callback. It makes the code inside __find_resource_space() more obvious and
removes the need for the caller to provide constraint->alignf
unnecessarily.
This is preparation for exporting find_resource_space().
Link: https://lore.kernel.org/r/[email protected]
Tested-by: Lidong Wang <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Mika Westerberg <[email protected]>
|
|
To make it simpler to declare resource constraint alignf callbacks, add
typedef for it and document it.
Suggested-by: Andy Shevchenko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Tested-by: Lidong Wang <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Mika Westerberg <[email protected]>
|
|
Document find_resource_space() and the struct resource_constraint as they
are going to be exposed outside of resource.c.
Link: https://lore.kernel.org/r/[email protected]
Tested-by: Lidong Wang <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Mika Westerberg <[email protected]>
|
|
Rename find_resource() to find_resource_space() to better describe what the
function does. This is a preparation for exposing it beyond resource.c,
which is needed by PCI core. Also rename the __ variant to match the names.
Link: https://lore.kernel.org/r/[email protected]
Tested-by: Lidong Wang <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Mika Westerberg <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-05-28
We've added 23 non-merge commits during the last 11 day(s) which contain
a total of 45 files changed, 696 insertions(+), 277 deletions(-).
The main changes are:
1) Rename skb's mono_delivery_time to tstamp_type for extensibility
and add SKB_CLOCK_TAI type support to bpf_skb_set_tstamp(),
from Abhishek Chauhan.
2) Add netfilter CT zone ID and direction to bpf_ct_opts so that arbitrary
CT zones can be used from XDP/tc BPF netfilter CT helper functions,
from Brad Cowie.
3) Several tweaks to the instruction-set.rst IETF doc to address
the Last Call review comments, from Dave Thaler.
4) Small batch of riscv64 BPF JIT optimizations in order to emit more
compressed instructions to the JITed image for better icache efficiency,
from Xiao Wang.
5) Sort bpftool C dump output from BTF, aiming to simplify vmlinux.h
diffing and forcing more natural type definitions ordering,
from Mykyta Yatsenko.
6) Use DEV_STATS_INC() macro in BPF redirect helpers to silence
a syzbot/KCSAN race report for the tx_errors counter,
from Jiang Yunshui.
7) Un-constify bpf_func_info in bpftool to fix compilation with LLVM 17+
which started treating const structs as constants and thus breaking
full BTF program name resolution, from Ivan Babrou.
8) Fix up BPF program numbers in test_sockmap selftest in order to reduce
some of the test-internal array sizes, from Geliang Tang.
9) Small cleanup in Makefile.btf script to use test-ge check for v1.25-only
pahole, from Alan Maguire.
10) Fix bpftool's make dependencies for vmlinux.h in order to avoid needless
rebuilds in some corner cases, from Artem Savkov.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
bpf, net: Use DEV_STAT_INC()
bpf, docs: Fix instruction.rst indentation
bpf, docs: Clarify call local offset
bpf, docs: Add table captions
bpf, docs: clarify sign extension of 64-bit use of 32-bit imm
bpf, docs: Use RFC 2119 language for ISA requirements
bpf, docs: Move sentence about returning R0 to abi.rst
bpf: constify member bpf_sysctl_kern:: Table
riscv, bpf: Try RVC for reg move within BPF_CMPXCHG JIT
riscv, bpf: Use STACK_ALIGN macro for size rounding up
riscv, bpf: Optimize zextw insn with Zba extension
selftests/bpf: Handle forwarding of UDP CLOCK_TAI packets
net: Add additional bit to support clockid_t timestamp type
net: Rename mono_delivery_time to tstamp_type for scalabilty
selftests/bpf: Update tests for new ct zone opts for nf_conntrack kfuncs
net: netfilter: Make ct zone opts configurable for bpf ct helpers
selftests/bpf: Fix prog numbers in test_sockmap
bpf: Remove unused variable "prev_state"
bpftool: Un-const bpf_func_info to fix it for llvm 17 and newer
bpf: Fix order of args in call to bpf_map_kvcalloc
...
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-05-27
We've added 15 non-merge commits during the last 7 day(s) which contain
a total of 18 files changed, 583 insertions(+), 55 deletions(-).
The main changes are:
1) Fix broken BPF multi-uprobe PID filtering logic which filtered by thread
while the promise was to filter by process, from Andrii Nakryiko.
2) Fix the recent influx of syzkaller reports to sockmap which triggered
a locking rule violation by performing a map_delete, from Jakub Sitnicki.
3) Fixes to netkit driver in particular on skb->pkt_type override upon pass
verdict, from Daniel Borkmann.
4) Fix an integer overflow in resolve_btfids which can wrongly trigger build
failures, from Friedrich Vock.
5) Follow-up fixes for ARC JIT reported by static analyzers,
from Shahab Vahedi.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Cover verifier checks for mutating sockmap/sockhash
Revert "bpf, sockmap: Prevent lock inversion deadlock in map delete elem"
bpf: Allow delete from sockmap/sockhash only if update is allowed
selftests/bpf: Add netkit test for pkt_type
selftests/bpf: Add netkit tests for mac address
netkit: Fix pkt_type override upon netkit pass verdict
netkit: Fix setting mac address in l2 mode
ARC, bpf: Fix issues reported by the static analyzers
selftests/bpf: extend multi-uprobe tests with USDTs
selftests/bpf: extend multi-uprobe tests with child thread case
libbpf: detect broken PID filtering logic for multi-uprobe
bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic
bpf: fix multi-uprobe PID filtering logic
bpf: Fix potential integer overflow in resolve_btfids
MAINTAINERS: Add myself as reviewer of ARM64 BPF JIT
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
We have seen an influx of syzkaller reports where a BPF program attached to
a tracepoint triggers a locking rule violation by performing a map_delete
on a sockmap/sockhash.
We don't intend to support this artificial use scenario. Extend the
existing verifier allowed-program-type check for updating sockmap/sockhash
to also cover deleting from a map.
From now on only BPF programs which were previously allowed to update
sockmap/sockhash can delete from these map types.
Fixes: ff9105993240 ("bpf, sockmap: Prevent lock inversion deadlock in map delete elem")
Reported-by: Tetsuo Handa <[email protected]>
Reported-by: [email protected]
Signed-off-by: Jakub Sitnicki <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: [email protected]
Acked-by: John Fastabend <[email protected]>
Closes: https://syzkaller.appspot.com/bug?extid=ec941d6e24f633a59172
Link: https://lore.kernel.org/bpf/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix io_uring based write-through after converting cifs to use the
netfs library
- Fix aio error handling when doing write-through via netfs library
- Fix performance regression in iomap when used with non-large folio
mappings
- Fix signalfd error code
- Remove obsolete comment in signalfd code
- Fix async request indication in netfs_perform_write() by raising
BDP_ASYNC when IOCB_NOWAIT is set
- Yield swap device immediately to prevent spurious EBUSY errors
- Don't cross a .backup mountpoint from backup volumes in afs to avoid
infinite loops
- Fix a race between umount and async request completion in 9p after 9p
was converted to use the netfs library
* tag 'vfs-6.10-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs, 9p: Fix race between umount and async request completion
afs: Don't cross .backup mountpoint from backup volume
swap: yield device immediately
netfs: Fix setting of BDP_ASYNC from iocb flags
signalfd: drop an obsolete comment
signalfd: fix error return code
iomap: fault in smaller chunks for non-large folio mappings
filemap: add helper mapping_max_folio_size()
netfs: Fix AIO error handling when doing write-through
netfs: Fix io_uring based write-through
|
|
Do a spell-checking pass.
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
btf_find_struct_member() might return NULL or an error via the
ERR_PTR() macro. However, its caller in parse_btf_field() only checks
for the NULL condition. Fix this by using IS_ERR() and returning the
error up the stack.
Link: https://lore.kernel.org/all/[email protected]/
Fixes: c440adfbe3025 ("tracing/probes: Support BTF based data structure field access")
Signed-off-by: Carlos López <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|
|
core.c has become rather large, move most scheduler syscall
related functionality into a separate file, syscalls.c.
This is about ~15% of core.c's raw linecount.
Move the alloc_user_cpus_ptr(), __rt_effective_prio(),
rt_effective_prio(), uclamp_none(), uclamp_se_set()
and uclamp_bucket_id() inlines to kernel/sched/sched.h.
Internally export the __sched_setscheduler(), __sched_setaffinity(),
__setscheduler_prio(), set_load_weight(), enqueue_task(), dequeue_task(),
check_class_changed(), splice_balance_callbacks() and balance_callbacks()
methods to better facilitate this.
Move the new file's build to sched_policy.c, because it fits there
semantically, but also because it's the smallest of the 4 build units
under an allmodconfig build:
-rw-rw-r-- 1 mingo mingo 7.3M May 27 12:35 kernel/sched/core.i
-rw-rw-r-- 1 mingo mingo 6.4M May 27 12:36 kernel/sched/build_utility.i
-rw-rw-r-- 1 mingo mingo 6.3M May 27 12:36 kernel/sched/fair.i
-rw-rw-r-- 1 mingo mingo 5.8M May 27 12:36 kernel/sched/build_policy.i
This better balances build time for scheduler subsystem rebuilds.
I build-tested this new file as a standalone syscalls.o file for a bit,
to make sure all the encapsulations & abstractions are robust.
Also update/add my copyright notices to these files.
Build time measurements:
# -Before/+After:
kepler:~/tip> perf stat -e 'cycles,instructions,duration_time' --sync --repeat 5 --pre 'rm -f kernel/sched/*.o' m kernel/sched/built-in.a >/dev/null
Performance counter stats for 'm kernel/sched/built-in.a' (5 runs):
- 71,938,508,607 cycles ( +- 0.17% )
+ 71,992,916,493 cycles ( +- 0.22% )
- 106,214,780,964 instructions # 1.48 insn per cycle ( +- 0.01% )
+ 105,450,231,154 instructions # 1.46 insn per cycle ( +- 0.01% )
- 5,878,232,620 ns duration_time ( +- 0.38% )
+ 5,290,085,069 ns duration_time ( +- 0.21% )
- 5.8782 +- 0.0221 seconds time elapsed ( +- 0.38% )
+ 5.2901 +- 0.0111 seconds time elapsed ( +- 0.21% )
Build time improvement of -11.1% (duration_time) is expected: the
parallel build time of the scheduler subsystem is determined by the
largest, slowest to build object file, which is kernel/sched/core.o.
By moving ~15% of its complexity into another build unit, we reduced
build time by -11%.
Measured cycles spent on building is within its ~0.2% stddev noise envelope.
The -0.7% reduction in instructions spent on building the scheduler is
statistically reliable and somewhat surprising - I can only speculate:
maybe compilers aren't that efficient at building & optimizing 10+ KLOC files
(core.c), and it's an overall win to balance the linecount a bit.
Anyway, this might be a data point that suggests that reducing the linecount
of our largest files will improve not just code readability and maintainability,
but might also improve build times a bit.
Code generation got a bit worse, by 0.5kb text on an x86 defconfig build:
# -Before/+After:
kepler:~/tip> size vmlinux
text data bss dec hex filename
-26475475 10439178 1740804 38655457 24dd5e1 vmlinux
+26476003 10439178 1740804 38655985 24dd7f1 vmlinux
kepler:~/tip> size kernel/sched/built-in.a
text data bss dec hex filename
- 76056 30025 489 106570 1a04a kernel/sched/core.o (ex kernel/sched/built-in.a)
+ 63452 29453 489 93394 16cd2 kernel/sched/core.o (ex kernel/sched/built-in.a)
44299 2181 104 46584 b5f8 kernel/sched/fair.o (ex kernel/sched/built-in.a)
- 42764 3424 120 46308 b4e4 kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
+ 55651 4044 120 59815 e9a7 kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
44866 12655 2192 59713 e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a)
44866 12655 2192 59713 e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a)
This is primarily due to the extra functions exported, and the size
gets exaggerated somewhat by __pfx CFI function padding:
ffffffff810cc710 <__pfx_enqueue_task>:
ffffffff810cc710: 90 nop
ffffffff810cc711: 90 nop
ffffffff810cc712: 90 nop
ffffffff810cc713: 90 nop
ffffffff810cc714: 90 nop
ffffffff810cc715: 90 nop
ffffffff810cc716: 90 nop
ffffffff810cc717: 90 nop
ffffffff810cc718: 90 nop
ffffffff810cc719: 90 nop
ffffffff810cc71a: 90 nop
ffffffff810cc71b: 90 nop
ffffffff810cc71c: 90 nop
ffffffff810cc71d: 90 nop
ffffffff810cc71e: 90 nop
ffffffff810cc71f: 90 nop
AFAICS the cost is primarily not to core.o and fair.o though (which contain
most performance sensitive scheduler functions), only to syscalls.o
that get called with much lower frequency - so I think this is an acceptable
trade-off for better code separation.
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Even though css_clear_dir would be called to cleanup
all existing cgroup files when css_populate_dir failed,
reclaiming newly created cgroup files before
css_populate_dir returns with failure makes code more
consistent.
Signed-off-by: David Wang <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Hierarchical counting of events is not practical for watching when a
particular pids.max is being hit. Therefore introduce .local flavor of
events file (akin to memory controller) that collects only events
relevant to given cgroup.
The file is only added to the default hierarchy.
Signed-off-by: Michal Koutný <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
The pids.events file should honor the hierarchy, so make the events
propagate from their origin up to the root on the unified hierarchy. The
legacy behavior remains non-hierarchical.
Signed-off-by: Michal Koutný <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Currently, when pids.max limit is breached in the hierarchy, the event
is counted and reported in the cgroup where the forking task resides.
This decouples the limit and the notification caused by the limit making
it hard to detect when the actual limit was effected.
Redefine the pids.events:max as: the number of times the limit of the
cgroup was hit.
(Implementation differentiates also "forkfail" event but this is
currently not exposed as it would better fit into pids.stat. It also
differs from pids.events:max only when pids.max is configured on
non-leaf cgroups.)
Since it changes semantics of the original "max" event, introduce this
change only in the v2 API of the controller and add a cgroup2 mount
option to revert to the legacy behavior.
Signed-off-by: Michal Koutný <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Since commit 51ffe41178c4 ("cpuset: convert away from cftype->read()"),
cpuset_common_file_read() has been renamed.
Signed-off-by: Xiu Jianfeng <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
The struct cpuset is kzalloc'd, all the members are zeroed already,
so don't need nodes_clear() here.
No functional changes intended.
Signed-off-by: Xiu Jianfeng <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
- Fix x86 IRQ vector leak caused by a CPU offlining race
- Fix build failure in the riscv-imsic irqchip driver
caused by an API-change semantic conflict
- Fix use-after-free in irq_find_at_or_after()
* tag 'irq-urgent-2024-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/irqdesc: Prevent use-after-free in irq_find_at_or_after()
genirq/cpuhotplug, x86/vector: Prevent vector leak during CPU offline
irqchip/riscv-imsic: Fixup riscv_ipi_set_virq_range() conflict
|
|
get_pid_task() internally already calls rcu_read_lock() and
rcu_read_unlock(), so there is no point to do this one extra time.
This is a drive-by improvement and has no correctness implications.
Acked-by: Jiri Olsa <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Current implementation of PID filtering logic for multi-uprobes in
uprobe_prog_run() is filtering down to exact *thread*, while the intent
for PID filtering it to filter by *process* instead. The check in
uprobe_prog_run() also differs from the analogous one in
uprobe_multi_link_filter() for some reason. The latter is correct,
checking task->mm, not the task itself.
Fix the check in uprobe_prog_run() to perform the same task->mm check.
While doing this, we also update get_pid_task() use to use PIDTYPE_TGID
type of lookup, given the intent is to get a representative task of an
entire process. This doesn't change behavior, but seems more logical. It
would hold task group leader task now, not any random thread task.
Last but not least, given multi-uprobe support is half-broken due to
this PID filtering logic (depending on whether PID filtering is
important or not), we need to make it easy for user space consumers
(including libbpf) to easily detect whether PID filtering logic was
already fixed.
We do it here by adding an early check on passed pid parameter. If it's
negative (and so has no chance of being a valid PID), we return -EINVAL.
Previous behavior would eventually return -ESRCH ("No process found"),
given there can't be any process with negative PID. This subtle change
won't make any practical change in behavior, but will allow applications
to detect PID filtering fixes easily. Libbpf fixes take advantage of
this in the next patch.
Cc: [email protected]
Acked-by: Jiri Olsa <[email protected]>
Fixes: b733eeade420 ("bpf: Add pid filter support for uprobe_multi link")
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more mm updates from Andrew Morton:
"Jeff Xu's implementation of the mseal() syscall"
* tag 'mm-stable-2024-05-24-11-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
selftest mm/mseal read-only elf memory segment
mseal: add documentation
selftest mm/mseal memory sealing
mseal: add mseal syscall
mseal: wire up mseal syscall
|
|
Otherwise we can cause spurious EBUSY issues when trying to mount the
rootfs later on.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218845
Reported-by: Petri Kaukasoina <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
irq_find_at_or_after() dereferences the interrupt descriptor which is
returned by mt_find() while neither holding sparse_irq_lock nor RCU read
lock, which means the descriptor can be freed between mt_find() and the
dereference:
CPU0 CPU1
desc = mt_find()
delayed_free_desc(desc)
irq_desc_get_irq(desc)
The use-after-free is reported by KASAN:
Call trace:
irq_get_next_irq+0x58/0x84
show_stat+0x638/0x824
seq_read_iter+0x158/0x4ec
proc_reg_read_iter+0x94/0x12c
vfs_read+0x1e0/0x2c8
Freed by task 4471:
slab_free_freelist_hook+0x174/0x1e0
__kmem_cache_free+0xa4/0x1dc
kfree+0x64/0x128
irq_kobj_release+0x28/0x3c
kobject_put+0xcc/0x1e0
delayed_free_desc+0x14/0x2c
rcu_do_batch+0x214/0x720
Guard the access with a RCU read lock section.
Fixes: 721255b9826b ("genirq: Use a maple tree for interrupt descriptor management")
Signed-off-by: dicken.ding <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
Patch series "Introduce mseal", v10.
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory range
against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW) and
no-execute (NX) bits. Linux has supported NX since the release of kernel
version 2.6.8 in August 2004 [1]. The memory permission feature improves
the security stance on memory corruption bugs, as an attacker cannot
simply write to arbitrary memory and point the code to it. The memory
must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects the
VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime. A
similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall
[4]. Also, Chrome wants to adopt this feature for their CFI work [2] and
this patchset has been designed to be compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in the
case of libc, sealing is only applied to read-only (RO) or read-execute
(RX) memory segments (such as .text and .RELRO) to prevent them from
becoming writable, the lifetime of those mappings are tied to the lifetime
of the process.
Chrome wants to seal two large address space reservations that are managed
by different allocators. The memory is mapped RW- and RWX respectively
but write access to it is restricted using pkeys (or in the future ARM
permission overlay extensions). The lifetime of those mappings are not
tied to the lifetime of the process, therefore, while the memory is
sealed, the allocators still need to free or discard the unused memory.
For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a security
risk. For example if a jump instruction crosses a page boundary and the
second page gets discarded, it will overwrite the target bytes with zeros
and change the control flow. Checking write-permission before the discard
operation allows us to control when the operation is valid. In this case,
the madvise will only succeed if the executing thread has PKEY write
permissions and PKRU changes are protected in software by control-flow
integrity.
Although the initial version of this patch series is targeting the Chrome
browser as its first user, it became evident during upstream discussions
that we would also want to ensure that the patch set eventually is a
complete solution for memory sealing and compatible with other use cases.
The specific scenario currently in mind is glibc's use case of loading and
sealing ELF executables. To this end, Stephen is working on a change to
glibc to add sealing support to the dynamic linker, which will seal all
non-writable segments at startup. Once this work is completed, all
applications will be able to automatically benefit from these new
protections.
In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental in
shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.
MM perf benchmarks
==================
This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
check the VMAs’ sealing flag, so that no partial update can be made,
when any segment within the given memory range is sealed.
To measure the performance impact of this loop, two tests are developed.
[8]
The first is measuring the time taken for a particular system call,
by using clock_gettime(CLOCK_MONOTONIC). The second is using
PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
similar results.
The tests have roughly below sequence:
for (i = 0; i < 1000, i++)
create 1000 mappings (1 page per VMA)
start the sampling
for (j = 0; j < 1000, j++)
mprotect one mapping
stop and save the sample
delete 1000 mappings
calculates all samples.
Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
4G memory, Chromebook.
Based on the latest upstream code:
The first test (measuring time)
syscall__ vmas t t_mseal delta_ns per_vma %
munmap__ 1 909 944 35 35 104%
munmap__ 2 1398 1502 104 52 107%
munmap__ 4 2444 2594 149 37 106%
munmap__ 8 4029 4323 293 37 107%
munmap__ 16 6647 6935 288 18 104%
munmap__ 32 11811 12398 587 18 105%
mprotect 1 439 465 26 26 106%
mprotect 2 1659 1745 86 43 105%
mprotect 4 3747 3889 142 36 104%
mprotect 8 6755 6969 215 27 103%
mprotect 16 13748 14144 396 25 103%
mprotect 32 27827 28969 1142 36 104%
madvise_ 1 240 262 22 22 109%
madvise_ 2 366 442 76 38 121%
madvise_ 4 623 751 128 32 121%
madvise_ 8 1110 1324 215 27 119%
madvise_ 16 2127 2451 324 20 115%
madvise_ 32 4109 4642 534 17 113%
The second test (measuring cpu cycle)
syscall__ vmas cpu cmseal delta_cpu per_vma %
munmap__ 1 1790 1890 100 100 106%
munmap__ 2 2819 3033 214 107 108%
munmap__ 4 4959 5271 312 78 106%
munmap__ 8 8262 8745 483 60 106%
munmap__ 16 13099 14116 1017 64 108%
munmap__ 32 23221 24785 1565 49 107%
mprotect 1 906 967 62 62 107%
mprotect 2 3019 3203 184 92 106%
mprotect 4 6149 6569 420 105 107%
mprotect 8 9978 10524 545 68 105%
mprotect 16 20448 21427 979 61 105%
mprotect 32 40972 42935 1963 61 105%
madvise_ 1 434 497 63 63 115%
madvise_ 2 752 899 147 74 120%
madvise_ 4 1313 1513 200 50 115%
madvise_ 8 2271 2627 356 44 116%
madvise_ 16 4312 4883 571 36 113%
madvise_ 32 8376 9319 943 29 111%
Based on the result, for 6.8 kernel, sealing check adds
20-40 nano seconds, or around 50-100 CPU cycles, per VMA.
In addition, I applied the sealing to 5.10 kernel:
The first test (measuring time)
syscall__ vmas t tmseal delta_ns per_vma %
munmap__ 1 357 390 33 33 109%
munmap__ 2 442 463 21 11 105%
munmap__ 4 614 634 20 5 103%
munmap__ 8 1017 1137 120 15 112%
munmap__ 16 1889 2153 263 16 114%
munmap__ 32 4109 4088 -21 -1 99%
mprotect 1 235 227 -7 -7 97%
mprotect 2 495 464 -30 -15 94%
mprotect 4 741 764 24 6 103%
mprotect 8 1434 1437 2 0 100%
mprotect 16 2958 2991 33 2 101%
mprotect 32 6431 6608 177 6 103%
madvise_ 1 191 208 16 16 109%
madvise_ 2 300 324 24 12 108%
madvise_ 4 450 473 23 6 105%
madvise_ 8 753 806 53 7 107%
madvise_ 16 1467 1592 125 8 108%
madvise_ 32 2795 3405 610 19 122%
The second test (measuring cpu cycle)
syscall__ nbr_vma cpu cmseal delta_cpu per_vma %
munmap__ 1 684 715 31 31 105%
munmap__ 2 861 898 38 19 104%
munmap__ 4 1183 1235 51 13 104%
munmap__ 8 1999 2045 46 6 102%
munmap__ 16 3839 3816 -23 -1 99%
munmap__ 32 7672 7887 216 7 103%
mprotect 1 397 443 46 46 112%
mprotect 2 738 788 50 25 107%
mprotect 4 1221 1256 35 9 103%
mprotect 8 2356 2429 72 9 103%
mprotect 16 4961 4935 -26 -2 99%
mprotect 32 9882 10172 291 9 103%
madvise_ 1 351 380 29 29 108%
madvise_ 2 565 615 49 25 109%
madvise_ 4 872 933 61 15 107%
madvise_ 8 1508 1640 132 16 109%
madvise_ 16 3078 3323 245 15 108%
madvise_ 32 5893 6704 811 25 114%
For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
CPU cycles, there is even decrease in some cases.
It might be interesting to compare 5.10 and 6.8 kernel
The first test (measuring time)
syscall__ vmas t_5_10 t_6_8 delta_ns per_vma %
munmap__ 1 357 909 552 552 254%
munmap__ 2 442 1398 956 478 316%
munmap__ 4 614 2444 1830 458 398%
munmap__ 8 1017 4029 3012 377 396%
munmap__ 16 1889 6647 4758 297 352%
munmap__ 32 4109 11811 7702 241 287%
mprotect 1 235 439 204 204 187%
mprotect 2 495 1659 1164 582 335%
mprotect 4 741 3747 3006 752 506%
mprotect 8 1434 6755 5320 665 471%
mprotect 16 2958 13748 10790 674 465%
mprotect 32 6431 27827 21397 669 433%
madvise_ 1 191 240 49 49 125%
madvise_ 2 300 366 67 33 122%
madvise_ 4 450 623 173 43 138%
madvise_ 8 753 1110 357 45 147%
madvise_ 16 1467 2127 660 41 145%
madvise_ 32 2795 4109 1314 41 147%
The second test (measuring cpu cycle)
syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma %
munmap__ 1 684 1790 1106 1106 262%
munmap__ 2 861 2819 1958 979 327%
munmap__ 4 1183 4959 3776 944 419%
munmap__ 8 1999 8262 6263 783 413%
munmap__ 16 3839 13099 9260 579 341%
munmap__ 32 7672 23221 15549 486 303%
mprotect 1 397 906 509 509 228%
mprotect 2 738 3019 2281 1140 409%
mprotect 4 1221 6149 4929 1232 504%
mprotect 8 2356 9978 7622 953 423%
mprotect 16 4961 20448 15487 968 412%
mprotect 32 9882 40972 31091 972 415%
madvise_ 1 351 434 82 82 123%
madvise_ 2 565 752 186 93 133%
madvise_ 4 872 1313 442 110 151%
madvise_ 8 1508 2271 763 95 151%
madvise_ 16 3078 4312 1234 77 140%
madvise_ 32 5893 8376 2483 78 142%
From 5.10 to 6.8
munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.
In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
increase from 5.10 to 6.8 is significantly larger, approximately ten times
greater for munmap and mprotect.
When I discuss the mm performance with Brian Makin, an engineer who worked
on performance, it was brought to my attention that such performance
benchmarks, which measuring millions of mm syscall in a tight loop, may
not accurately reflect real-world scenarios, such as that of a database
service. Also this is tested using a single HW and ChromeOS, the data
from another HW or distribution might be different. It might be best to
take this data with a grain of salt.
This patch (of 5):
Wire up mseal syscall for all architectures.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jeff Xu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Jann Horn <[email protected]> [Bug #2]
Cc: Jeff Xu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Jorge Lucangeli Obes <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muhammad Usama Anjum <[email protected]>
Cc: Pedro Falcato <[email protected]>
Cc: Stephen Röttger <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Amer Al Shanawany <[email protected]>
Cc: Javier Carrasco <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Recent changes made uprobe_cpu_buffer preparation lazy, and moved it
deeper into __uprobe_trace_func(). This is problematic because
__uprobe_trace_func() is called inside rcu_read_lock()/rcu_read_unlock()
block, which then calls prepare_uprobe_buffer() -> uprobe_buffer_get() ->
mutex_lock(&ucb->mutex), leading to a splat about using mutex under
non-sleepable RCU:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 98231, name: stress-ng-sigq
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
...
Call Trace:
<TASK>
dump_stack_lvl+0x3d/0xe0
__might_resched+0x24c/0x270
? prepare_uprobe_buffer+0xd5/0x1d0
__mutex_lock+0x41/0x820
? ___perf_sw_event+0x206/0x290
? __perf_event_task_sched_in+0x54/0x660
? __perf_event_task_sched_in+0x54/0x660
prepare_uprobe_buffer+0xd5/0x1d0
__uprobe_trace_func+0x4a/0x140
uprobe_dispatcher+0x135/0x280
? uprobe_dispatcher+0x94/0x280
uprobe_notify_resume+0x650/0xec0
? atomic_notifier_call_chain+0x21/0x110
? atomic_notifier_call_chain+0xf8/0x110
irqentry_exit_to_user_mode+0xe2/0x1e0
asm_exc_int3+0x35/0x40
RIP: 0033:0x7f7e1d4da390
Code: 33 04 00 0f 1f 80 00 00 00 00 f3 0f 1e fa b9 01 00 00 00 e9 b2 fc ff ff 66 90 f3 0f 1e fa 31 c9 e9 a5 fc ff ff 0f 1f 44 00 00 <cc> 0f 1e fa b8 27 00 00 00 0f 05 c3 0f 1f 40 00 f3 0f 1e fa b8 6e
RSP: 002b:00007ffd2abc3608 EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000076d325f1 RCX: 0000000000000000
RDX: 0000000076d325f1 RSI: 000000000000000a RDI: 00007ffd2abc3690
RBP: 000000000000000a R08: 00017fb700000000 R09: 00017fb700000000
R10: 00017fb700000000 R11: 0000000000000246 R12: 0000000000017ff2
R13: 00007ffd2abc3610 R14: 0000000000000000 R15: 00007ffd2abc3780
</TASK>
Luckily, it's easy to fix by moving prepare_uprobe_buffer() to be called
slightly earlier: into uprobe_trace_func() and uretprobe_trace_func(), outside
of RCU locked section. This still keeps this buffer preparation lazy and helps
avoid the overhead when it's not needed. E.g., if there is only BPF uprobe
handler installed on a given uprobe, buffer won't be initialized.
Note, the other user of prepare_uprobe_buffer(), __uprobe_perf_func(), is not
affected, as it doesn't prepare buffer under RCU read lock.
Link: https://lore.kernel.org/all/[email protected]/
Fixes: 1b8f85defbc8 ("uprobes: prepare uprobe args buffer lazily")
Reported-by: Breno Leitao <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
|
|
The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
interrupt affinity reconfiguration via procfs. Instead, the change is
deferred until the next instance of the interrupt being triggered on the
original CPU.
When the interrupt next triggers on the original CPU, the new affinity is
enforced within __irq_move_irq(). A vector is allocated from the new CPU,
but the old vector on the original CPU remains and is not immediately
reclaimed. Instead, apicd->move_in_progress is flagged, and the reclaiming
process is delayed until the next trigger of the interrupt on the new CPU.
Upon the subsequent triggering of the interrupt on the new CPU,
irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
remains online. Subsequently, the timer on the old CPU iterates over its
vector_cleanup list, reclaiming old vectors.
However, a rare scenario arises if the old CPU is outgoing before the
interrupt triggers again on the new CPU.
In that case irq_force_complete_move() is not invoked on the outgoing CPU
to reclaim the old apicd->prev_vector because the interrupt isn't currently
affine to the outgoing CPU, and irq_needs_fixup() returns false. Even
though __vector_schedule_cleanup() is later called on the new CPU, it
doesn't reclaim apicd->prev_vector; instead, it simply resets both
apicd->move_in_progress and apicd->prev_vector to 0.
As a result, the vector remains unreclaimed in vector_matrix, leading to a
CPU vector leak.
To address this issue, move the invocation of irq_force_complete_move()
before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the
interrupt is currently or used to be affine to the outgoing CPU.
Additionally, reclaim the vector in __vector_schedule_cleanup() as well,
following a warning message, although theoretically it should never see
apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU.
Fixes: f0383c24b485 ("genirq/cpuhotplug: Add support for cleaning up move in progress")
Signed-off-by: Dongli Zhang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|