Age | Commit message (Collapse) | Author | Files | Lines |
|
kzalloc virtual addresses do not work with SetPageReserved, use the actual
page virtual addresses instead via alloc_pages.
The issue is reported when booting with user_events and
DEBUG_VM_PGFLAGS=y.
Also make the number of events based on the ORDER.
Link: https://lore.kernel.org/all/CADYN=9+xY5Vku3Ws5E9S60SM5dCFfeGeRBkmDFbcxX0ZMoFing@mail.gmail.com/
Link: https://lore.kernel.org/all/[email protected]/
Cc: Beau Belgrave <[email protected]>
Reported-by: Anders Roxell <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Merge misc fixes from David Howells:
"A set of patches for watch_queue filter issues noted by Jann. I've
added in a cleanup patch from Christophe Jaillet to convert to using
formal bitmap specifiers for the note allocation bitmap.
Also two filesystem fixes (afs and cachefiles)"
* emailed patches from David Howells <[email protected]>:
cachefiles: Fix volume coherency attribute
afs: Fix potential thrashing in afs writeback
watch_queue: Make comment about setting ->defunct more accurate
watch_queue: Fix lack of barrier/sync/lock between post and read
watch_queue: Free the alloc bitmap when the watch_queue is torn down
watch_queue: Fix the alloc bitmap size to reflect notes allocated
watch_queue: Use the bitmap API when applicable
watch_queue: Fix to always request a pow-of-2 pipe ring size
watch_queue: Fix to release page in ->release()
watch_queue, pipe: Free watchqueue state after clearing pipe ring
watch_queue: Fix filter limit check
|
|
watch_queue_clear() has a comment stating that setting ->defunct to true
preventing new additions as well as preventing notifications. Whilst
the latter is true, the first bit is superfluous since at the time this
function is called, the pipe cannot be accessed to add new event
sources.
Remove the "new additions" bit from the comment.
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There's nothing to synchronise post_one_notification() versus
pipe_read(). Whilst posting is done under pipe->rd_wait.lock, the
reader only takes pipe->mutex which cannot bar notification posting as
that may need to be made from contexts that cannot sleep.
Fix this by setting pipe->head with a barrier in post_one_notification()
and reading pipe->head with a barrier in pipe_read().
If that's not sufficient, the rd_wait.lock will need to be taken,
possibly in a ->confirm() op so that it only applies to notifications.
The lock would, however, have to be dropped before copy_page_to_iter()
is invoked.
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Free the watch_queue note allocation bitmap when the watch_queue is
destroyed.
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, watch_queue_set_size() sets the number of notes available in
wqueue->nr_notes according to the number of notes allocated, but sets
the size of the bitmap to the unrounded number of notes originally asked
for.
Fix this by setting the bitmap size to the number of notes we're
actually going to make available (ie. the number allocated).
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use bitmap_alloc() to simplify code, improve the semantic and reduce
some open-coded arithmetic in allocator arguments.
Also change a memset(0xff) into an equivalent bitmap_fill() to keep
consistency.
Signed-off-by: Christophe JAILLET <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The pipe ring size must always be a power of 2 as the head and tail
pointers are masked off by AND'ing with the size of the ring - 1.
watch_queue_set_size(), however, lets you specify any number of notes
between 1 and 511. This number is passed through to pipe_resize_ring()
without checking/forcing its alignment.
Fix this by rounding the number of slots required up to the nearest
power of two. The request is meant to guarantee that at least that many
notifications can be generated before the queue is full, so rounding
down isn't an option, but, alternatively, it may be better to give an
error if we aren't allowed to allocate that much ring space.
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When a pipe ring descriptor points to a notification message, the
refcount on the backing page is incremented by the generic get function,
but the release function, which marks the bitmap, doesn't drop the page
ref.
Fix this by calling generic_pipe_buf_release() at the end of
watch_queue_pipe_buf_release().
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In watch_queue_set_filter(), there are a couple of places where we check
that the filter type value does not exceed what the type_filter bitmap
can hold. One place calculates the number of bits by:
if (tf[i].type >= sizeof(wfilter->type_filter) * 8)
which is fine, but the second does:
if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG)
which is not. This can lead to a couple of out-of-bounds writes due to
a too-large type:
(1) __set_bit() on wfilter->type_filter
(2) Writing more elements in wfilter->filters[] than we allocated.
Fix this by just using the proper WATCH_TYPE__NR instead, which is the
number of types we actually know about.
The bug may cause an oops looking something like:
BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740
Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611
...
Call Trace:
<TASK>
dump_stack_lvl+0x45/0x59
print_address_description.constprop.0+0x1f/0x150
...
kasan_report.cold+0x7f/0x11b
...
watch_queue_set_filter+0x659/0x740
...
__x64_sys_ioctl+0x127/0x190
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Allocated by task 611:
kasan_save_stack+0x1e/0x40
__kasan_kmalloc+0x81/0xa0
watch_queue_set_filter+0x23a/0x740
__x64_sys_ioctl+0x127/0x190
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88800d2c66a0
which belongs to the cache kmalloc-32 of size 32
The buggy address is located 28 bytes inside of
32-byte region [ffff88800d2c66a0, ffff88800d2c66c0)
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add ftrace_boot_snapshot kernel parameter that will take a snapshot at the
end of boot up just before switching over to user space (it happens during
the kernel freeing of init memory).
This is useful when there's interesting data that can be collected from
kernel start up, but gets overridden by user space start up code. With
this option, the ring buffer content from the boot up traces gets saved in
the snapshot at the end of boot up. This trace can be read from:
/sys/kernel/tracing/snapshot
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
The macro TRACE_DEFINE_ENUM is used to convert enums in the kernel to
their actual value when they are exported to user space via the trace
event format file.
Currently only the enums in the "print fmt" (TP_printk in the TRACE_EVENT
macro) have the enums converted. But the enums can be used to denote array
size:
field:unsigned int fc_ineligible_rc[EXT4_FC_REASON_MAX]; offset:12; size:36; signed:0;
The EXT4_FC_REASON_MAX has no meaning to userspace but it needs to know
that information to know how to parse the array.
Have the array indexes also be parsed as well.
Link: https://lore.kernel.org/all/[email protected]/
Reported-by: Ritesh Harjani <[email protected]>
Tested-by: Ritesh Harjani <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
0-day reported the strncpy error below:
../kernel/trace/trace_events_synth.c: In function 'last_cmd_set':
../kernel/trace/trace_events_synth.c:65:9: warning: 'strncpy' specified bound depends on the length o\
f the source argument [-Wstringop-truncation]
65 | strncpy(last_cmd, str, strlen(str) + 1);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../kernel/trace/trace_events_synth.c:65:32: note: length computed here
65 | strncpy(last_cmd, str, strlen(str) + 1);
| ^~~~~~~~~~~
There's no reason to use strncpy here, in fact there's no reason to do
anything but a simple kstrdup() (note we don't even need to check for
failure since last_cmod is expected to be either the last cmd string
or NULL, and the containing function is a void return).
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 27c888da9867 ("tracing: Remove size restriction on synthetic event cmd error logging")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Find user_events always while under the event_mutex and before leaving
the lock, add a ref count to the user_event. This ensures that all paths
under the event_mutex that check the ref counts will be synchronized.
The ioctl add/delete paths are protected by the reg_mutex. However,
dyn_event is only protected by the event_mutex. The dyn_event delete
path cannot acquire reg_mutex, since that could cause a deadlock between
the ioctl delete case acquiring event_mutex after acquiring the reg_mutex.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Beau Belgrave <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Make bpf_lsm_kernel_read_file() as sleepable, so that bpf_ima_inode_hash()
or bpf_ima_file_hash() can be called inside the implementation of this
hook.
Signed-off-by: Roberto Sassu <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
ima_file_hash() has been modified to calculate the measurement of a file on
demand, if it has not been already performed by IMA or the measurement is
not fresh. For compatibility reasons, ima_inode_hash() remains unchanged.
Keep the same approach in eBPF and introduce the new helper
bpf_ima_file_hash() to take advantage of the modified behavior of
ima_file_hash().
Signed-off-by: Roberto Sassu <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Minor tracing fixes:
- Fix unregistering the same event twice. A user could disable the
same event that osnoise will disable on unregistering.
- Inform RCU of a quiescent state in the osnoise testing thread.
- Fix some kerneldoc comments"
* tag 'trace-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix some W=1 warnings in kernel doc comments
tracing/osnoise: Force quiescent states while tracing
tracing/osnoise: Do not unregister events twice
|
|
net/dsa/dsa2.c
commit afb3cc1a397d ("net: dsa: unlock the rtnl_mutex when dsa_master_setup() fails")
commit e83d56537859 ("net: dsa: replay master state events in dsa_tree_{setup,teardown}_master")
https://lore.kernel.org/all/[email protected]/
drivers/net/ethernet/intel/ice/ice.h
commit 97b0129146b1 ("ice: Fix error with handling of bonding MTU")
commit 43113ff73453 ("ice: add TTY for GNSS module for E810T device")
https://lore.kernel.org/all/[email protected]/
drivers/staging/gdm724x/gdm_lte.c
commit fc7f750dc9d1 ("staging: gdm724x: fix use after free in gdm_lte_rx()")
commit 4bcc4249b4cf ("staging: Use netif_rx().")
https://lore.kernel.org/all/[email protected]/
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Allow custom events to be added to the events directory in the tracefs
file system. For example, a module could be installed that attaches to an
event and wants to be enabled and disabled via the tracefs file system. It
would use trace_add_event_call() to add the event to the tracefs
directory, and trace_remove_event_call() to remove it.
Make both those functions EXPORT_SYMBOL_GPL().
Link: https://lkml.kernel.org/r/[email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Using strnlen(dest, str, n) is confusing, as the size of dest must be
strlen(dest) + n + 1. Even more confusing, using sizeof(string constant)
gives you strlen(string constant) + 1 and not just strlen().
These two together made using strncat() with a constant string a bit off
in the calculations as we have:
len = sizeof(HIST_PREFIX) + strlen(str) + 1;
kfree(last_cmd);
last_cmd = kzalloc(len, GFP_KERNEL);
strcpy(last_cmd, HIST_PREFIX);
len -= sizeof(HIST_PREFIX) + 1;
strncat(last_cmd, str, len);
The above works if we s/sizeof/strlen/ with HIST_PREFIX (which is defined
as "hist:", but because sizeof(HIST_PREFIX) is equal to
strlen(HIST_PREFIX) + 1, we can drop the +1 in the code. But at least
comment that we are doing so.
Link: https://lore.kernel.org/all/[email protected]/
Fixes: 9f8e5aee93ed2 ("tracing: Fix allocation of last_cmd in last_cmd_set()")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Ensure name is initialized by default to NULL to prevent possible edge
cases that could lead to it being left uninitialized. Add an explicit
check for NULL name to ensure edge boundaries.
Link: https://lore.kernel.org/bpf/20220224105334.GA2248@kili/
Link: https://lore.kernel.org/linux-trace-devel/[email protected]
Signed-off-by: Beau Belgrave <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Use offsetofend() instead of offsetof() + sizeof() to simplify
MIN_BPF_LINEINFO_SIZE macro definition.
Signed-off-by: Yuntao Wang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: Joanne Koong <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
This adds support for running XDP programs through BPF_PROG_RUN in a mode
that enables live packet processing of the resulting frames. Previous uses
of BPF_PROG_RUN for XDP returned the XDP program return code and the
modified packet data to userspace, which is useful for unit testing of XDP
programs.
The existing BPF_PROG_RUN for XDP allows userspace to set the ingress
ifindex and RXQ number as part of the context object being passed to the
kernel. This patch reuses that code, but adds a new mode with different
semantics, which can be selected with the new BPF_F_TEST_XDP_LIVE_FRAMES
flag.
When running BPF_PROG_RUN in this mode, the XDP program return codes will
be honoured: returning XDP_PASS will result in the frame being injected
into the networking stack as if it came from the selected networking
interface, while returning XDP_TX and XDP_REDIRECT will result in the frame
being transmitted out that interface. XDP_TX is translated into an
XDP_REDIRECT operation to the same interface, since the real XDP_TX action
is only possible from within the network drivers themselves, not from the
process context where BPF_PROG_RUN is executed.
Internally, this new mode of operation creates a page pool instance while
setting up the test run, and feeds pages from that into the XDP program.
The setup cost of this is amortised over the number of repetitions
specified by userspace.
To support the performance testing use case, we further optimise the setup
step so that all pages in the pool are pre-initialised with the packet
data, and pre-computed context and xdp_frame objects stored at the start of
each page. This makes it possible to entirely avoid touching the page
content on each XDP program invocation, and enables sending up to 9
Mpps/core on my test box.
Because the data pages are recycled by the page pool, and the test runner
doesn't re-initialise them for each run, subsequent invocations of the XDP
program will see the packet data in the state it was after the last time it
ran on that particular page. This means that an XDP program that modifies
the packet before redirecting it has to be careful about which assumptions
it makes about the packet content, but that is only an issue for the most
naively written programs.
Enabling the new flag is only allowed when not setting ctx_out and data_out
in the test specification, since using it means frames will be redirected
somewhere else, so they can't be returned.
Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
There are a few places where we test the current process' capability set
to decide if we're going to be more or less generous with resource
acquisition for a system call. If the process doesn't have the
capability, we can continue the call, albeit in a degraded mode.
These are /not/ the actual security decisions, so it's not proper to use
capable(), which (in certain selinux setups) causes audit messages to
get logged. Switch them to has_capability_noaudit.
Fixes: 7317a03df703f ("xfs: refactor inode ownership change transaction/inode/quota allocation idiom")
Fixes: ea9a46e1c4925 ("xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN")
Signed-off-by: Darrick J. Wong <[email protected]>
Cc: Dave Chinner <[email protected]>
Reviewed-by: Ondrej Mosnacek <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Eric Sandeen <[email protected]>
|
|
Clean up the following clang-w1 warning:
kernel/trace/ftrace.c:7827: warning: Function parameter or member 'ops'
not described in 'unregister_ftrace_function'.
kernel/trace/ftrace.c:7805: warning: Function parameter or member 'ops'
not described in 'register_ftrace_function'.
Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Jiapeng Chong <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
At the moment running osnoise on a nohz_full CPU or uncontested FIFO
priority and a PREEMPT_RCU kernel might have the side effect of
extending grace periods too much. This will entice RCU to force a
context switch on the wayward CPU to end the grace period, all while
introducing unwarranted noise into the tracer. This behaviour is
unavoidable as overly extending grace periods might exhaust the system's
memory.
This same exact problem is what extended quiescent states (EQS) were
created for, conversely, rcu_momentary_dyntick_idle() emulates them by
performing a zero duration EQS. So let's make use of it.
In the common case rcu_momentary_dyntick_idle() is fairly inexpensive:
atomically incrementing a local per-CPU counter and doing a store. So it
shouldn't affect osnoise's measurements (which has a 1us granularity),
so we'll call it unanimously.
The uncommon case involve calling rcu_momentary_dyntick_idle() after
having the osnoise process:
- Receive an expedited quiescent state IPI with preemption disabled or
during an RCU critical section. (activates rdp->cpu_no_qs.b.exp
code-path).
- Being preempted within in an RCU critical section and having the
subsequent outermost rcu_read_unlock() called with interrupts
disabled. (t->rcu_read_unlock_special.b.blocked code-path).
Neither of those are possible at the moment, and are unlikely to be in
the future given the osnoise's loop design. On top of this, the noise
generated by the situations described above is unavoidable, and if not
exposed by rcu_momentary_dyntick_idle() will be eventually seen in
subsequent rcu_read_unlock() calls or schedule operations.
Link: https://lkml.kernel.org/r/[email protected]
Cc: [email protected]
Fixes: bce29ac9ce0b ("trace: Add osnoise tracer")
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Acked-by: Paul E. McKenney <[email protected]>
Acked-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Nicolas reported that using:
# trace-cmd record -e all -M 10 -p osnoise --poll
Resulted in the following kernel warning:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1217 at kernel/tracepoint.c:404 tracepoint_probe_unregister+0x280/0x370
[...]
CPU: 0 PID: 1217 Comm: trace-cmd Not tainted 5.17.0-rc6-next-20220307-nico+ #19
RIP: 0010:tracepoint_probe_unregister+0x280/0x370
[...]
CR2: 00007ff919b29497 CR3: 0000000109da4005 CR4: 0000000000170ef0
Call Trace:
<TASK>
osnoise_workload_stop+0x36/0x90
tracing_set_tracer+0x108/0x260
tracing_set_trace_write+0x94/0xd0
? __check_object_size.part.0+0x10a/0x150
? selinux_file_permission+0x104/0x150
vfs_write+0xb5/0x290
ksys_write+0x5f/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7ff919a18127
[...]
---[ end trace 0000000000000000 ]---
The warning complains about an attempt to unregister an
unregistered tracepoint.
This happens on trace-cmd because it first stops tracing, and
then switches the tracer to nop. Which is equivalent to:
# cd /sys/kernel/tracing/
# echo osnoise > current_tracer
# echo 0 > tracing_on
# echo nop > current_tracer
The osnoise tracer stops the workload when no trace instance
is actually collecting data. This can be caused both by
disabling tracing or disabling the tracer itself.
To avoid unregistering events twice, use the existing
trace_osnoise_callback_enabled variable to check if the events
(and the workload) are actually active before trying to
deactivate them.
Link: https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/938765e17d5a781c2df429a98f0b2e7cc317b022.1646823913.git.bristot@kernel.org
Cc: [email protected]
Cc: Marcelo Tosatti <[email protected]>
Fixes: 2fac8d6486d5 ("tracing/osnoise: Allow multiple instances of the same tracer")
Reported-by: Nicolas Saenz Julienne <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Unnecessarily grabbing the tasklist_lock can be a scalability bottleneck
for workloads that also must grab the tasklist_lock for waiting,
killing, and cloning.
The tasklist_lock was grabbed to protect tsk->sighand from disappearing
(becoming NULL). tsk->signal was already protected by holding a
reference to tsk.
update_rlimit_cpu() assumed tsk->sighand != NULL. With this commit, it
attempts to lock_task_sighand(). However, this means that
update_rlimit_cpu() can fail. This only happens when a task is exiting.
Note that during exec, sighand may *change*, but it will not be NULL.
Prior to this commit, the do_prlimit() ensured that update_rlimit_cpu()
would not fail by read locking the tasklist_lock and checking tsk->sighand
!= NULL.
If update_rlimit_cpu() fails, there may be other tasks that are not
exiting that share tsk->signal. However, the group_leader is the last
task to be released, so if we cannot update_rlimit_cpu(group_leader),
then the entire process is exiting.
The only other caller of update_rlimit_cpu() is
selinux_bprm_committing_creds(). It has tsk == current, so
update_rlimit_cpu() cannot fail (current->sighand cannot disappear
until current exits).
This change resulted in a 14% speedup on a microbenchmark where parents
kill and wait on their children, and children getpriority, setpriority,
and getrlimit.
Signed-off-by: Barret Rhoden <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
There are no other callers in the kernel.
Fixed up a comment format and whitespace issue when moving do_prlimit()
higher in sys.c.
Signed-off-by: Barret Rhoden <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
build_sched_domains
While investigating the sparse warning reported by the LKP bot [1],
observed that we have a redundant variable "top" in the function
build_sched_domains that was introduced in the recent commit
e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance when
SD_NUMA spans multiple LLCs")
The existing variable "sd" suffices which allows us to remove the
redundant variable "top" while annotating the other variable "top_p"
with the "__rcu" annotation to silence the sparse warning.
[1] https://lore.kernel.org/lkml/[email protected]/
Fixes: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: K Prateek Nayak <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The `struct rq *rq` parameter isn't used. Remove it.
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The need_pull_[rt|dl]_task() and pull_[rt|dl]_task() functions are not
used on a !CONFIG_SMP system. Remove them.
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Deploy __node_2_pdl(node), __node_2_dle(node) and rb_first_cached()
consistently throughout the sched class source file which makes the
code at least easier to read.
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Both functions are doing almost the same, that is checking if admission
control is still respected.
With exclusive cpusets, dl_task_can_attach() checks if the destination
cpuset (i.e. its root domain) has enough CPU capacity to accommodate the
task.
dl_cpu_busy() checks if there is enough CPU capacity in the cpuset in
case the CPU is hot-plugged out.
dl_task_can_attach() is used to check if a task can be admitted while
dl_cpu_busy() is used to check if a CPU can be hotplugged out.
Make dl_cpu_busy() able to deal with a task and use it instead of
dl_task_can_attach() in task_can_attach().
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
source file
Move the deadline bandwidth management (admission control) functions
__dl_add(), __dl_sub() and __dl_overflow() as well as the bandwidth
reclaim function __dl_update() from private task scheduler header file
to the deadline sched class source file.
The functions are only used internally so they don't have to be
exported.
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Since commit 1724813d9f2c ("sched/deadline: Remove the sysctl_sched_dl
knobs") the default deadline bandwidth control structure has no purpose.
Remove it.
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Instead of determining buf_info string in the caller of check_buffer_access(),
we can determine whether the register type is read-only through
type_is_rdonly_mem() helper inside check_buffer_access() and construct
buf_info, making the code slightly cleaner.
Signed-off-by: Shung-Hsi Yu <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/YiWYLnAkEZXBP/gH@syu-laptop
|
|
The trailing slash of LIBBPF_SRCS is redundant, remove it. Also inline
it as its only used in LIBBPF_INCLUDE.
Signed-off-by: Yuntao Wang <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Using strncpy() on NUL-terminated strings is considered deprecated[1].
Moreover, if the length of 'task->comm' is less than the destination buffer
size, strncpy() will NUL-pad the destination buffer, which is a needless
performance penalty.
Replacing strncpy() with strscpy() fixes all these issues.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Signed-off-by: Yuntao Wang <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 spectre fixes from Borislav Petkov:
- Mitigate Spectre v2-type Branch History Buffer attacks on machines
which support eIBRS, i.e., the hardware-assisted speculation
restriction after it has been shown that such machines are vulnerable
even with the hardware mitigation.
- Do not use the default LFENCE-based Spectre v2 mitigation on AMD as
it is insufficient to mitigate such attacks. Instead, switch to
retpolines on all AMD by default.
- Update the docs and add some warnings for the obviously vulnerable
cmdline configurations.
* tag 'x86_bugs_for_v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Warn about eIBRS + LFENCE + Unprivileged eBPF + SMT
x86/speculation: Warn about Spectre v2 LFENCE mitigation
x86/speculation: Update link to AMD speculation whitepaper
x86/speculation: Use generic retpoline by default on AMD
x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting
Documentation/hw-vuln: Update spectre doc
x86/speculation: Add eIBRS + Retpoline options
x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCE
|
|
RCU_SOFTIRQ used to be special in that it could be raised on purpose
within the idle path to prevent from stopping the tick. Some code still
prevents from unnecessary warnings related to this specific behaviour
while entering in dynticks-idle mode.
However the nohz layout has changed quite a bit in ten years, and the
removal of CONFIG_RCU_FAST_NO_HZ has been the final straw to this
safe-conduct. Now the RCU_SOFTIRQ vector is expected to be raised from
sane places.
A remaining corner case is admitted though when the vector is invoked
in fragile hotplug path.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Menzel <[email protected]>
|
|
With the removal of CONFIG_RCU_FAST_NO_HZ, the parameters in
rcu_needs_cpu() are not necessary anymore. Simply remove them.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Menzel <[email protected]>
|
|
On some rare cases, the timekeeper CPU may be delaying its jiffies
update duty for a while. Known causes include:
* The timekeeper is waiting on stop_machine in a MULTI_STOP_DISABLE_IRQ
or MULTI_STOP_RUN state. Disabled interrupts prevent from timekeeping
updates while waiting for the target CPU to complete its
stop_machine() callback.
* The timekeeper vcpu has VMEXIT'ed for a long while due to some overload
on the host.
Detect and fix these situations with emergency timekeeping catchups.
Original-patch-by: Paul E. McKenney <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
|
|
Unfortunately, we ended up merging an old version of the patch "fix info
leak with DMA_FROM_DEVICE" instead of merging the latest one. Christoph
(the swiotlb maintainer), he asked me to create an incremental fix
(after I have pointed this out the mix up, and asked him for guidance).
So here we go.
The main differences between what we got and what was agreed are:
* swiotlb_sync_single_for_device is also required to do an extra bounce
* We decided not to introduce DMA_ATTR_OVERWRITE until we have exploiters
* The implantation of DMA_ATTR_OVERWRITE is flawed: DMA_ATTR_OVERWRITE
must take precedence over DMA_ATTR_SKIP_CPU_SYNC
Thus this patch removes DMA_ATTR_OVERWRITE, and makes
swiotlb_sync_single_for_device() bounce unconditionally (that is, also
when dir == DMA_TO_DEVICE) in order do avoid synchronising back stale
data from the swiotlb buffer.
Let me note, that if the size used with dma_sync_* API is less than the
size used with dma_[un]map_*, under certain circumstances we may still
end up with swiotlb not being transparent. In that sense, this is no
perfect fix either.
To get this bullet proof, we would have to bounce the entire
mapping/bounce buffer. For that we would have to figure out the starting
address, and the size of the mapping in
swiotlb_sync_single_for_device(). While this does seem possible, there
seems to be no firm consensus on how things are supposed to work.
Signed-off-by: Halil Pasic <[email protected]>
Fixes: ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE")
Cc: [email protected]
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into timers/core
Pull clocksource watchdog update from Paul McKenney:
- Add a config option for the maximum skew of the watchdog.
Link: https://lore.kernel.org/r/20220224000718.GA3747431@paulmck-ThinkPad-P17-Gen-1
|
|
Merge a topic branch we are maintaining with some cross-architecture
changes to function descriptor handling and their use in LKDTM.
From Christophe's cover letter:
Fix LKDTM for PPC64/IA64/PARISC
PPC64/IA64/PARISC have function descriptors. LKDTM doesn't work on those
three architectures because LKDTM messes up function descriptors with
functions.
This series does some cleanup in the three architectures and refactors
function descriptors so that it can then easily use it in a generic way
in LKDTM.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix sorting on old "cpu" value in histograms
- Fix return value of __setup() boot parameter handlers
* tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix return value of __setup handlers
tracing/histogram: Fix sorting on old "cpu" value
|
|
With the introduction of the btf_type_tag "percpu", we can add a
MEM_PERCPU to identify those pointers that point to percpu memory.
The ability of differetiating percpu pointers from regular memory
pointers have two benefits:
1. It forbids unexpected use of percpu pointers, such as direct loads.
In kernel, there are special functions used for accessing percpu
memory. Directly loading percpu memory is meaningless. We already
have BPF helpers like bpf_per_cpu_ptr() and bpf_this_cpu_ptr() that
wrap the kernel percpu functions. So we can now convert percpu
pointers into regular pointers in a safe way.
2. Previously, bpf_per_cpu_ptr() and bpf_this_cpu_ptr() only work on
PTR_TO_PERCPU_BTF_ID, a special reg_type which describes static
percpu variables in kernel (we rely on pahole to encode them into
vmlinux BTF). Now, since we can identify __percpu tagged pointers,
we can also identify dynamically allocated percpu memory as well.
It means we can use bpf_xxx_cpu_ptr() on dynamic percpu memory.
This would be very convenient when accessing fields like
"cgroup->rstat_cpu".
Signed-off-by: Hao Luo <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
With the introduction of MEM_USER in
commit c6f1bfe89ac9 ("bpf: reject program if a __user tagged memory accessed in kernel way")
PTR_TO_BTF_ID can be combined with a MEM_USER tag. Therefore, most
likely, when we compare reg_type against PTR_TO_BTF_ID, we want to use
the reg's base_type. Previously the check in check_mem_access() wants
to say: if the reg is BTF_ID but not NULL, the execution flow falls
into the 'then' branch. But now a reg of (BTF_ID | MEM_USER), which
should go into the 'then' branch, goes into the 'else'.
The end results before and after this patch are the same: regs tagged
with MEM_USER get rejected, but not in a way we intended. So fix the
condition, the error message now is correct.
Before (log from commit 696c39011538):
$ ./test_progs -v -n 22/3
...
libbpf: prog 'test_user1': BPF program load failed: Permission denied
libbpf: prog 'test_user1': -- BEGIN PROG LOAD LOG --
R1 type=ctx expected=fp
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
; int BPF_PROG(test_user1, struct bpf_testmod_btf_type_tag_1 *arg)
0: (79) r1 = *(u64 *)(r1 +0)
func 'bpf_testmod_test_btf_type_tag_user_1' arg0 has btf_id 136561 type STRUCT 'bpf_testmod_btf_type_tag_1'
1: R1_w=user_ptr_bpf_testmod_btf_type_tag_1(id=0,off=0,imm=0)
; g = arg->a;
1: (61) r1 = *(u32 *)(r1 +0)
R1 invalid mem access 'user_ptr_'
Now:
libbpf: prog 'test_user1': BPF program load failed: Permission denied
libbpf: prog 'test_user1': -- BEGIN PROG LOAD LOG --
R1 type=ctx expected=fp
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
; int BPF_PROG(test_user1, struct bpf_testmod_btf_type_tag_1 *arg)
0: (79) r1 = *(u64 *)(r1 +0)
func 'bpf_testmod_test_btf_type_tag_user_1' arg0 has btf_id 104036 type STRUCT 'bpf_testmod_btf_type_tag_1'
1: R1_w=user_ptr_bpf_testmod_btf_type_tag_1(id=0,ref_obj_id=0,off=0,imm=0)
; g = arg->a;
1: (61) r1 = *(u32 *)(r1 +0)
R1 is ptr_bpf_testmod_btf_type_tag_1 access user memory: off=0
Note the error message for the reason of rejection.
Fixes: c6f1bfe89ac9 ("bpf: reject program if a __user tagged memory accessed in kernel way")
Signed-off-by: Hao Luo <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Let's ensure that the PTR_TO_BTF_ID reg being passed in to release BPF
helpers and kfuncs always has its offset set to 0. While not a real
problem now, there's a very real possibility this will become a problem
when more and more kfuncs are exposed, and more BPF helpers are added
which can release PTR_TO_BTF_ID.
Previous commits already protected against non-zero var_off. One of the
case we are concerned about now is when we have a type that can be
returned by e.g. an acquire kfunc:
struct foo {
int a;
int b;
struct bar b;
};
... and struct bar is also a type that can be returned by another
acquire kfunc.
Then, doing the following sequence:
struct foo *f = bpf_get_foo(); // acquire kfunc
if (!f)
return 0;
bpf_put_bar(&f->b); // release kfunc
... would work with the current code, since the btf_struct_ids_match
takes reg->off into account for matching pointer type with release kfunc
argument type, but would obviously be incorrect, and most likely lead to
a kernel crash. A test has been included later to prevent regressions in
this area.
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|