| Age | Commit message (Collapse) | Author | Files | Lines |
|
BPF verifier supports direct memory access for BPF_PROG_TYPE_TRACING type
of bpf programs, e.g., a->b. If "a" is a pointer
pointing to kernel memory, bpf verifier will allow user to write
code in C like a->b and the verifier will translate it to a kernel
load properly. If "a" is a pointer to user memory, it is expected
that bpf developer should be bpf_probe_read_user() helper to
get the value a->b. Without utilizing BTF __user tagging information,
current verifier will assume that a->b is a kernel memory access
and this may generate incorrect result.
Now BTF contains __user information, it can check whether the
pointer points to a user memory or not. If it is, the verifier
can reject the program and force users to use bpf_probe_read_user()
helper explicitly.
In the future, we can easily extend btf_add_space for other
address space tagging, for example, rcu/percpu etc.
Signed-off-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Even though there is a static key protecting from overhead from
cgroup-bpf skb filtering when there is nothing attached, in many cases
it's not enough as registering a filter for one type will ruin the fast
path for all others. It's observed in production servers I've looked
at but also in laptops, where registration is done during init by
systemd or something else.
Add a per-socket fast path check guarding from such overhead. This
affects both receive and transmit paths of TCP, UDP and other
protocols. It showed ~1% tx/s improvement in small payload UDP
send benchmarks using a real NIC and in a server environment and the
number jumps to 2-3% for preemtible kernels.
Reviewed-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/r/d8c58857113185a764927a46f4b5a058d36d3ec3.1643292455.git.asml.silence@gmail.com
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
When CONFIG_PROC_FS is disabled psi code generates the following warnings:
kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=]
1364 | static const struct proc_ops psi_cpu_proc_ops = {
| ^~~~~~~~~~~~~~~~
kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=]
1355 | static const struct proc_ops psi_memory_proc_ops = {
| ^~~~~~~~~~~~~~~~~~~
kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=]
1346 | static const struct proc_ops psi_io_proc_ops = {
| ^~~~~~~~~~~~~~~
Make definitions of these structures and related functions conditional on
CONFIG_PROC_FS config.
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
iowait_boost signal is applied independently of util and doesn't take
into account uclamp settings of the rq. An io heavy task that is capped
by uclamp_max could still request higher frequency because
sugov_iowait_apply() doesn't clamp the boost via uclamp_rq_util_with()
like effective_cpu_util() does.
Make sure that iowait_boost honours uclamp requests by calling
uclamp_rq_util_with() when applying the boost.
Fixes: 982d9cdc22c9 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
sugov_update_single_{freq, perf}() contains a 'busy' filter that ensures
we don't bring the frqeuency down if there's no idle time (CPU is busy).
The problem is that with uclamp_max we will have scenarios where a busy
task is capped to run at a lower frequency and this filter prevents
applying the capping when this task starts running.
We handle this by skipping the filter when uclamp is enabled and the rq
is being capped by uclamp_max.
We introduce a new function uclamp_rq_is_capped() to help detecting when
this capping is taking effect. Some code shuffling was required to allow
using cpu_util_{cfs, rt}() in this new function.
On 2 Core SMT2 Intel laptop I see:
Without this patch:
uclampset -M 0 sysbench --test=cpu --threads = 4 run
produces a score of ~3200 consistently. Which is the highest possible.
Compiling the kernel also results in frequency running at max 3.1GHz all
the time - running uclampset -M 400 to cap it has no effect without this
patch.
With this patch:
uclampset -M 0 sysbench --test=cpu --threads = 4 run
produces a score of ~1100 with some outliers in ~1700. Uclamp max
aggregates the performance requirements, so having high values sometimes
is expected if some other task happens to require that frequency starts
running at the same time.
When compiling the kernel with uclampset -M 400 I can see the
frequencies mostly in the ~2GHz region. Helpful to conserve power and
prevent heating when not plugged in.
Fixes: 982d9cdc22c9 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
We can't use this tracepoint in modules without having the symbol
exported first, fix that.
Fixes: 765047932f15 ("sched/pelt: Add support to track thermal pressure")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The child processes will inherit numa_pages_migrated and
total_numa_faults from the parent. It means even if there is no numa
fault happen on the child, the statistics in /proc/$pid of the child
process might show huge amount. This is a bit weird. Let's initialize
them when do fork.
Signed-off-by: Honglei Wang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The older format of /proc/pid/sched printed home node info which
required the mempolicy and task lock around mpol_get(). However
the format has changed since then and there is no need for
sched_show_numa() any more to have mempolicy argument,
asssociated mpol_get/put and task_lock/unlock. Remove them.
Fixes: 397f2378f1361 ("sched/numa: Fix numa balancing stats in /proc/pid/sched")
Signed-off-by: Bharata B Rao <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When the ucount code was refactored to create get_ucount it was missed
that some of the contexts in which a rlimit is kept elevated can be
the only reference to the user/ucount in the system.
Ordinary ucount references exist in places that also have a reference
to the user namspace, but in POSIX message queues, the SysV shm code,
and the SIGPENDING code there is no independent user namespace
reference.
Inspection of the the user_namespace show no instance of circular
references between struct ucounts and the user_namespace. So
hold a reference from struct ucount to i's user_namespace to
resolve this problem.
Link: https://lore.kernel.org/lkml/[email protected]/
Reported-by: Qian Cai <[email protected]>
Reported-by: Mathias Krause <[email protected]>
Tested-by: Mathias Krause <[email protected]>
Reviewed-by: Mathias Krause <[email protected]>
Reviewed-by: Alexey Gladkov <[email protected]>
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Fixes: 6e52a9f0532f ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
Cc: [email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
The ->percpu_enqueue_shift field is used to map from the running CPU
number to the index of the corresponding callback list. This mapping
can change at runtime in response to varying callback load, resulting
in varying levels of contention on the callback-list locks.
Unfortunately, the initial value of this field is correct only if the
system happens to have a power-of-two number of CPUs, otherwise the
callbacks from the high-numbered CPUs can be placed into the callback list
indexed by 1 (rather than 0), and those index-1 callbacks will be ignored.
This can result in soft lockups and hangs.
This commit therefore corrects this mapping, adding one to this shift
count as needed for systems having odd numbers of CPUs.
Fixes: 7a30871b6a27 ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection")
Reported-by: Andrii Nakryiko <[email protected]>
Cc: Reported-by: Martin Lau <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
subparts_cpus should be limited as a subset of cpus_allowed, but it is
updated wrongly by using cpumask_andnot(). Use cpumask_and() instead to
fix it.
Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Tianchen Ding <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Prefer kcalloc() to kzalloc(array_size()) for allocating an array.
Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
SWIOTLB's includes have become a great big mess. Restore some order by
consolidating the random different blocks, sorting alphabetically, and
purging some clearly unnecessary entries - linux/io.h is now included
unconditionally, so need not be duplicated in the restricted DMA pool
case; similarly, linux/io.h subsumes asm/io.h; and by now it's a
mystery why asm/dma.h was ever here at all.
Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
Debugfs functions are already stubbed out for !CONFIG_DEBUG_FS, so we
can remove most of the #ifdefs, just keeping one to manually optimise
away the initcall when it would do nothing. We can also simplify the
code itself by factoring out the directory creation and realising that
the global io_tlb_default_mem now makes debugfs_dir redundant.
Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
For larger TDX VM, memset() after set_memory_decrypted() in
swiotlb_update_mem_attributes() takes substantial portion of boot time.
Zeroing doesn't serve any functional purpose. Malicious VMM can mess
with decrypted/shared buffer at any point.
Remove the memset().
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Tom Lendacky <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
prb_next_seq() always iterates from the first known sequence number.
In the worst case, it might loop 8k times for 256kB buffer,
15k times for 512kB buffer, and 64k times for 2MB buffer.
It was reported that polling and reading using syslog interface
might occupy 50% of CPU.
Speedup the search by storing @id of the last finalized descriptor.
The loop is still needed because the @id is stored and read in the best
effort way. An atomic variable is used to keep the @id consistent.
But the stores and reads are not serialized against each other.
The descriptor could get reused in the meantime. The related sequence
number will be used only when it is still valid.
An invalid value should be read _only_ when there is a flood of messages
and the ringbuffer is rapidly reused. The performance is the least
problem in this case.
Reported-by: Chunlei Wang <[email protected]>
Signed-off-by: Mukesh Ojha <[email protected]>
Reviewed-by: John Ogness <[email protected]>
Signed-off-by: Petr Mladek <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/YXlddJxLh77DKfIO@alley/T/#m43062e8b2a17f8dbc8c6ccdb8851fb0dbaabbb14
|
|
The active cgroup events are managed in the per-cpu cgrp_cpuctx_list.
This list is only accessed from current cpu and not protected by any
locks. But from the commit ef54c1a476ae ("perf: Rework
perf_event_exit_event()"), it's possible to access (actually modify)
the list from another cpu.
In the perf_remove_from_context(), it can remove an event from the
context without an IPI when the context is not active. This is not
safe with cgroup events which can have some active events in the
context even if ctx->is_active is 0 at the moment. The target cpu
might be in the middle of list iteration at the same time.
If the event is enabled when it's about to be closed, it might call
perf_cgroup_event_disable() and list_del() with the cgrp_cpuctx_list
on a different cpu.
This resulted in a crash due to an invalid list pointer access during
the cgroup list traversal on the cpu which the event belongs to.
Let's fallback to IPI to access the cgrp_cpuctx_list from that cpu.
Similarly, perf_install_in_context() should use IPI for the cgroup
events too.
Fixes: ef54c1a476ae ("perf: Rework perf_event_exit_event()")
Signed-off-by: Namhyung Kim <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
When using per-process mode and event inheritance is set to true,
forked processes will create a new perf events via inherit_event() ->
perf_event_alloc(). But these events will not have ring buffers
assigned to them. Any call to wakeup will be dropped if it's called on
an event with no ring buffer assigned because that's the object that
holds the wakeup list.
If the child event is disabled due to a call to
perf_aux_output_begin() or perf_aux_output_end(), the wakeup is
dropped leaving userspace hanging forever on the poll.
Normally the event is explicitly re-enabled by userspace after it
wakes up to read the aux data, but in this case it does not get woken
up so the event remains disabled.
This can be reproduced when using Arm SPE and 'stress' which forks once
before running the workload. By looking at the list of aux buffers read,
it's apparent that they stop after the fork:
perf record -e arm_spe// -vvv -- stress -c 1
With this patch applied they continue to be printed. This behaviour
doesn't happen when using systemwide or per-cpu mode.
Reported-by: Ruben Ayrapetyan <[email protected]>
Signed-off-by: James Clark <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Commit dee872e124e8 ("bpf: Populate kfunc BTF ID sets in struct btf")
breaks loading of some modules when CONFIG_DEBUG_INFO_BTF is not set.
register_btf_kfunc_id_set returns -ENOENT to the callers when
there is no module btf. Let's return 0 (success) instead to let
those modules work in !CONFIG_DEBUG_INFO_BTF cases.
Acked-by: Kumar Kartikeya Dwivedi <[email protected]>
Fixes: dee872e124e8 ("bpf: Populate kfunc BTF ID sets in struct btf")
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
It seems inc_misses_counter() suffers from same issue fixed in
the commit d979617aa84d ("bpf: Fixes possible race in update_prog_stats()
for 32bit arches"):
As it can run while interrupts are enabled, it could
be re-entered and the u64_stats syncp could be mangled.
Fixes: 9ed9e9ba2337 ("bpf: Count the number of times recursion was prevented")
Signed-off-by: He Fengqing <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
It was found that reading /proc/lockdep after a lockdep splat may
potentially cause an access to freed memory if lockdep_unregister_key()
is called after the splat but before access to /proc/lockdep [1]. This
is due to the fact that graph_lock() call in lockdep_unregister_key()
fails after the clearing of debug_locks by the splat process.
After lockdep_unregister_key() is called, the lock_name may be freed
but the corresponding lock_class structure still have a reference to
it. That invalid memory pointer will then be accessed when /proc/lockdep
is read by a user and a use-after-free (UAF) error will be reported if
KASAN is enabled.
To fix this problem, lockdep_unregister_key() is now modified to always
search for a matching key irrespective of the debug_locks state and
zap the corresponding lock class if a matching one is found.
[1] https://lore.kernel.org/lkml/[email protected]/
Fixes: 8b39adbee805 ("locking/lockdep: Make lockdep_unregister_key() honor 'debug_locks' again")
Reported-by: Tetsuo Handa <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
use memset_startat() helper to simplify the code, there is no functional
change in this patch.
Signed-off-by: Xiu Jianfeng <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The membarrier command MEMBARRIER_CMD_QUERY allows querying the
available membarrier commands. When the membarrier-rseq fence commands
were added, a new MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ_BITMASK was
introduced with the intent to expose them with the MEMBARRIER_CMD_QUERY
command, the but it was never added to MEMBARRIER_CMD_BITMASK.
The membarrier-rseq fence commands are therefore not wired up with the
query command.
Rename MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ_BITMASK to
MEMBARRIER_PRIVATE_EXPEDITED_RSEQ_BITMASK (the bitmask is not a command
per-se), and change the erroneous
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ_BITMASK (which does not
actually exist) to MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ.
Wire up MEMBARRIER_PRIVATE_EXPEDITED_RSEQ_BITMASK in
MEMBARRIER_CMD_BITMASK. Fixing this allows discovering availability of
the membarrier-rseq fence feature.
Fixes: 2a36ab717e8f ("rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: <[email protected]> # 5.10+
Link: https://lkml.kernel.org/r/[email protected]
|
|
When an admin enables audit at early boot via the "audit=1" kernel
command line the audit queue behavior is slightly different; the
audit subsystem goes to greater lengths to avoid dropping records,
which unfortunately can result in problems when the audit daemon is
forcibly stopped for an extended period of time.
This patch makes a number of changes designed to improve the audit
queuing behavior so that leaving the audit daemon in a stopped state
for an extended period does not cause a significant impact to the
system.
- kauditd_send_queue() is now limited to looping through the
passed queue only once per call. This not only prevents the
function from looping indefinitely when records are returned
to the current queue, it also allows any recovery handling in
kauditd_thread() to take place when kauditd_send_queue()
returns.
- Transient netlink send errors seen as -EAGAIN now cause the
record to be returned to the retry queue instead of going to
the hold queue. The intention of the hold queue is to store,
perhaps for an extended period of time, the events which led
up to the audit daemon going offline. The retry queue remains
a temporary queue intended to protect against transient issues
between the kernel and the audit daemon.
- The retry queue is now limited by the audit_backlog_limit
setting, the same as the other queues. This allows admins
to bound the size of all of the audit queues on the system.
- kauditd_rehold_skb() now returns records to the end of the
hold queue to ensure ordering is preserved in the face of
recent changes to kauditd_send_queue().
Cc: [email protected]
Fixes: 5b52330bbfe63 ("audit: fix auditd/kernel connection state tracking")
Fixes: f4b3ee3c85551 ("audit: improve robustness of the audit queue handling")
Reported-by: Gaosheng Cui <[email protected]>
Tested-by: Gaosheng Cui <[email protected]>
Reviewed-by: Richard Guy Briggs <[email protected]>
Signed-off-by: Paul Moore <[email protected]>
|
|
It is an unused wrapper forcing kmalloc allocation for registering
nosave regions. Also, rename __register_nosave_region() to
register_nosave_region() now that there is no need for disambiguation.
Signed-off-by: Amadeusz Sławiński <[email protected]>
Reviewed-by: Cezary Rojewski <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
The buffer handling in pm_show_wakelocks() is tricky, and hopefully
correct. Ensure it really is correct by using sysfs_emit_at() which
handles all of the tricky string handling logic in a PAGE_SIZE buffer
for us automatically as this is a sysfs file being read from.
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Reviewed-by: Lee Jones <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
The commit 6326948f940d missed renaming of task->current LSM hook in BTF_ID.
Fix it to silence build warning:
WARN: resolve_btfids: unresolved symbol bpf_lsm_task_getsecid_subj
Fixes: 6326948f940d ("lsm: security_task_getsecid_subj() -> security_current_getsecid_subj()")
Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
This adds a helper for bpf programs to read the memory of other
tasks.
As an example use case at Meta, we are using a bpf task iterator program
and this new helper to print C++ async stack traces for all threads of
a given process.
Signed-off-by: Kenny Yu <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
This patch allows bpf iterator programs to use sleepable helpers by
changing `bpf_iter_run_prog` to use the appropriate synchronization.
With sleepable bpf iterator programs, we can no longer use
`rcu_read_lock()` and must use `rcu_read_lock_trace()` instead
to protect the bpf program.
Signed-off-by: Kenny Yu <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-01-24
We've added 80 non-merge commits during the last 14 day(s) which contain
a total of 128 files changed, 4990 insertions(+), 895 deletions(-).
The main changes are:
1) Add XDP multi-buffer support and implement it for the mvneta driver,
from Lorenzo Bianconi, Eelco Chaudron and Toke Høiland-Jørgensen.
2) Add unstable conntrack lookup helpers for BPF by using the BPF kfunc
infra, from Kumar Kartikeya Dwivedi.
3) Extend BPF cgroup programs to export custom ret value to userspace via
two helpers bpf_get_retval() and bpf_set_retval(), from YiFei Zhu.
4) Add support for AF_UNIX iterator batching, from Kuniyuki Iwashima.
5) Complete missing UAPI BPF helper description and change bpf_doc.py script
to enforce consistent & complete helper documentation, from Usama Arif.
6) Deprecate libbpf's legacy BPF map definitions and streamline XDP APIs to
follow tc-based APIs, from Andrii Nakryiko.
7) Support BPF_PROG_QUERY for BPF programs attached to sockmap, from Di Zhu.
8) Deprecate libbpf's bpf_map__def() API and replace users with proper getters
and setters, from Christy Lee.
9) Extend libbpf's btf__add_btf() with an additional hashmap for strings to
reduce overhead, from Kui-Feng Lee.
10) Fix bpftool and libbpf error handling related to libbpf's hashmap__new()
utility function, from Mauricio Vásquez.
11) Add support to BTF program names in bpftool's program dump, from Raman Shukhau.
12) Fix resolve_btfids build to pick up host flags, from Connor O'Brien.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (80 commits)
selftests, bpf: Do not yet switch to new libbpf XDP APIs
selftests, xsk: Fix rx_full stats test
bpf: Fix flexible_array.cocci warnings
xdp: disable XDP_REDIRECT for xdp frags
bpf: selftests: add CPUMAP/DEVMAP selftests for xdp frags
bpf: selftests: introduce bpf_xdp_{load,store}_bytes selftest
net: xdp: introduce bpf_xdp_pointer utility routine
bpf: generalise tail call map compatibility check
libbpf: Add SEC name for xdp frags programs
bpf: selftests: update xdp_adjust_tail selftest to include xdp frags
bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature
bpf: introduce frags support to bpf_prog_test_run_xdp()
bpf: move user_size out of bpf_test_init
bpf: add frags support to xdp copy helpers
bpf: add frags support to the bpf_xdp_adjust_tail() API
bpf: introduce bpf_xdp_get_buff_len helper
net: mvneta: enable jumbo frames if the loaded XDP program support frags
bpf: introduce BPF_F_XDP_HAS_FRAGS flag in prog_flags loading the ebpf program
net: mvneta: add frags support to XDP_TX
xdp: add frags support to xdp_return_{buff/frame}
...
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- A series of bpf fixes, including an oops fix and some codegen fixes.
- Fix a regression in syscall_get_arch() for compat processes.
- Fix boot failure on some 32-bit systems with KASAN enabled.
- A couple of other build/minor fixes.
Thanks to Athira Rajeev, Christophe Leroy, Dmitry V. Levin, Jiri Olsa,
Johan Almbladh, Maxime Bizon, Naveen N. Rao, and Nicholas Piggin.
* tag 'powerpc-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Mask SRR0 before checking against the masked NIP
powerpc/perf: Only define power_pmu_wants_prompt_pmi() for CONFIG_PPC64
powerpc/32s: Fix kasan_init_region() for KASAN
powerpc/time: Fix build failure due to do_hard_irq_enable() on PPC32
powerpc/audit: Fix syscall_get_arch()
powerpc64/bpf: Limit 'ldbrx' to processors compliant with ISA v2.06
tools/bpf: Rename 'struct event' to avoid naming conflict
powerpc/bpf: Update ldimm64 instructions during extra pass
powerpc32/bpf: Fix codegen for bpf-to-bpf calls
bpf: Guard against accessing NULL pt_regs in bpf_get_task_stack()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
"A bunch of fixes: forced idle time accounting, utilization values
propagation in the sched hierarchies and other minor cleanups and
improvements"
* tag 'sched_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kernel/sched: Remove dl_boosted flag comment
sched: Avoid double preemption in __cond_resched_*lock*()
sched/fair: Fix all kernel-doc warnings
sched/core: Accounting forceidle time for all tasks except idle task
sched/pelt: Relax the sync of load_sum with load_avg
sched/pelt: Relax the sync of runnable_sum with runnable_avg
sched/pelt: Continue to relax the sync of util_sum with util_avg
sched/pelt: Relax the sync of util_sum with util_avg
psi: Fix uaf issue when psi trigger is destroyed while being polled
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Add support for accessing the general purpose counters on Alder Lake
via MMIO
- Add new LBR format v7 support which is v5 modulo TSX
- Fix counter enumeration on Alder Lake hybrids
- Overhaul how context time updates are done and get rid of
perf_event::shadow_ctx_time.
- The usual amount of fixes: event mask correction, supported event
types reporting, etc.
* tag 'perf_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/perf: Avoid warning for Arch LBR without XSAVE
perf/x86/intel/uncore: Add IMC uncore support for ADL
perf/x86/intel/lbr: Add static_branch for LBR INFO flags
perf/x86/intel/lbr: Support LBR format V7
perf/x86/rapl: fix AMD event handling
perf/x86/intel/uncore: Fix CAS_COUNT_WRITE issue for ICX
perf/x86/intel: Add a quirk for the calculation of the number of counters on Alder Lake
perf: Fix perf_event_read_local() time
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Fix s390 breakage from sorting mcount tables.
The latest merge of the tracing tree sorts the mcount table at build
time. But s390 appears to do things differently (like always) and
replaces the sorted table back to the original unsorted one. As the
ftrace algorithm depends on it being sorted, bad things happen when it
is not, and s390 experienced those bad things.
Add a new config to tell the boot if the mcount table is sorted or
not, and allow s390 to opt out of it"
* tag 'trace-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix assuming build time sort works for s390
|
|
To speed up the boot process, as mcount_loc needs to be sorted for ftrace
to work properly, sorting it at build time is more efficient than boot up
and can save milliseconds of time. Unfortunately, this change broke s390
as it will modify the mcount_loc location after the sorting takes place
and will put back the unsorted locations. Since the sorting is skipped at
boot up if it is believed that it was sorted at run time, ftrace can crash
as its algorithms are dependent on the list being sorted.
Add a new config BUILDTIME_MCOUNT_SORT that is set when
BUILDTIME_TABLE_SORT but not if S390 is set. Use this config to determine
if sorting should take place at boot up.
Link: https://lore.kernel.org/all/[email protected]/
Fixes: 72b3942a173c ("scripts: ftrace - move the sort-processing in ftrace_init")
Reported-by: Sven Schnelle <[email protected]>
Tested-by: Heiko Carstens <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Pull bitmap updates from Yury Norov:
- introduce for_each_set_bitrange()
- use find_first_*_bit() instead of find_next_*_bit() where possible
- unify for_each_bit() macros
* tag 'bitmap-5.17-rc1' of git://github.com/norov/linux:
vsprintf: rework bitmap_list_string
lib: bitmap: add performance test for bitmap_print_to_pagebuf
bitmap: unify find_bit operations
mm/percpu: micro-optimize pcpu_is_populated()
Replace for_each_*_bit_from() with for_each_*_bit() where appropriate
find: micro-optimize for_each_{set,clear}_bit()
include/linux: move for_each_bit() macros from bitops.h to find.h
cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
tools: sync tools/bitmap with mother linux
all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate
cpumask: use find_first_and_bit()
lib: add find_first_and_bit()
arch: remove GENERIC_FIND_FIRST_BIT entirely
include: move find.h from asm_generic to linux
bitops: move find_bit_*_le functions from le.h to find.h
bitops: protect find_first_{,zero}_bit properly
|
|
Remove PDE_DATA() completely and replace it with pde_data().
[[email protected]: fix naming clash in drivers/nubus/proc.c]
[[email protected]: now fix it properly]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Alexey Gladkov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In preparation for converting bit_spin_lock to rwlock in zsmalloc so
that multiple writers of zspages can run at the same time but those
zspages are supposed to be different zspage instance. Thus, it's not
deadlock. This patch adds write_lock_nested to support the case for
LOCKDEP.
[[email protected]: fix write_lock_nested for RT]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fixup write_lock_nested() implementation]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Naresh Kamboju <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
proc_doulongvec_minmax
When we pass a negative value to the proc_doulongvec_minmax() function,
the function returns 0, but the corresponding interface value does not
change.
we can easily reproduce this problem with the following commands:
cd /proc/sys/fs/epoll
echo -1 > max_user_watches; echo $?; cat max_user_watches
This function requires a non-negative number to be passed in, so when a
negative number is passed in, -EINVAL is returned.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Baokun Li <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Acked-by: Luis Chamberlain <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The const variable ten_thousand is not used, it is redundant and can be
removed.
Cleans up clang warning:
kernel/sysctl.c:99:18: warning: unused variable 'ten_thousand' [-Wunused-const-variable]
static const int ten_thousand = 10000;
Link: https://lkml.kernel.org/r/[email protected]
Fixes: c26da54dc8ca ("printk: move printk sysctl to printk/sysctl.c")
Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Luis Chamberlain <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
Move sysctl_kprobes_optimization from kernel/sysctl.c to
kernel/kprobes.c. Use register_sysctl() to register the sysctl
interface.
[[email protected]: fix compile issue when CONFIG_OPTPROBES is disabled]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Xiaoming Ni <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Stephen Kitt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This moves the fs/coredump.c respective sysctls to its own file.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Xiaoming Ni <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Stephen Kitt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
build warning when CONFIG_PRINTK=n
kernel/printk/printk.c:175:5: warning: no previous prototype for
'devkmsg_sysctl_set_loglvl' [-Wmissing-prototypes]
devkmsg_sysctl_set_loglvl() is only used in sysctl.c when
CONFIG_PRINTK=y, but it participates in the build when CONFIG_PRINTK=n.
So add compile dependency CONFIG_PRINTK=y && CONFIG_SYSCTL=y to fix the
build warning.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Xiaoming Ni <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Stephen Kitt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Rename sysctl_init() to sysctl_init_bases() so to reflect exactly what
this is doing.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Xiaoming Ni <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This moves the namespace sysctls to its own file as part of the
kernel/sysctl.c spring cleaning
Since we have now removed all sysctls for "fs", we now have to declare
it on the filesystem code, we do that using the new helper, which
reduces boiler plate code.
We rename init_fs_shared_sysctls() to init_fs_sysctls() to reflect that
now fs/sysctls.c is taking on the burden of being the first to register
the base directory as well.
Lastly, since init code will load in the order in which we link it we
have to move the sysctl code to be linked in early, so that its early
init routine runs prior to other fs code. This way, other filesystem
code can register their own sysctls using the helpers after this:
* register_sysctl_init()
* register_sysctl()
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Xiaoming Ni <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "sysctl: add and use base directory declarer and
registration helper".
In this patch series we start addressing base directories, and so we
start with the "fs" sysctls. The end goal is we end up completely
moving all "fs" sysctl knobs out from kernel/sysctl.
This patch (of 6):
Add a set of helpers which can be used to declare and register base
directory sysctls on their own. We do this so we can later move each of
the base sysctl directories like "fs", "kernel", etc, to their own
respective files instead of shoving the declarations and registrations
all on kernel/sysctl.c. The lazy approach has caught up and with this,
we just end up extending the list of base directories / sysctls on one
file and this makes maintenance difficult due to merge conflicts from
many developers.
The declarations are used first by kernel/sysctl.c for registration its
own base which over time we'll try to clean up. It will be used in the
next patch to demonstrate how to cleanly deal with base sysctl
directories.
[[email protected]: null-terminate the ctl_table arrays]
Link: https://lkml.kernel.org/r/YafJY3rXDYnjK/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: Xiaoming Ni <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move the pipe sysctls to its own file.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: "J. Bruce Fields" <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Xiaoming Ni <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move the fs/exec.c respective sysctls to its own file.
Since checkpatch complains about style issues with the old code, this
move also fixes a few of those minor style issues:
* Use pr_warn() instead of prink(WARNING
* New empty lines are wanted at the beginning of routines
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: "J. Bruce Fields" <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Xiaoming Ni <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move namei's own sysctl knobs to its own file.
Other than the move we also avoid initializing two static variables to 0
as this is not needed:
* sysctl_protected_symlinks
* sysctl_protected_hardlinks
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: "J. Bruce Fields" <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Xiaoming Ni <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
The locking fs sysctls are only used on fs/locks.c, so move them there.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: "J. Bruce Fields" <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Xiaoming Ni <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|