Age | Commit message (Collapse) | Author | Files | Lines |
|
fast_switch_enabled
Note that sugov_update_next_freq() may return false, that means the
caller sugov_fast_switch() will do nothing except fast switch check.
Similarly, sugov_deferred_update() also has unnecessary operations
of raw_spin_{lock,unlock} in sugov_update_single_freq() for that case.
So, let's call sugov_update_next_freq() before the fast switch check
to avoid unnecessary behaviors above. Accordingly, update interface
definition to sugov_deferred_update() and remove sugov_fast_switch()
since we will call cpufreq_driver_fast_switch() directly instead.
Signed-off-by: Yue Hu <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
It is a common mistake for someone writing a trace event to save a pointer
to a string in the TP_fast_assign() and then display that string pointer
in the TP_printk() with %s. The problem is that those two events may happen
a long time apart, where the source of the string may no longer exist.
The proper way to handle displaying any string that is not guaranteed to be
in the kernel core rodata section, is to copy it into the ring buffer via
the __string(), __assign_str() and __get_str() helper macros.
Add a check at run time while displaying the TP_printk() of events to make
sure that every %s referenced is safe to dereference, and if it is not,
trigger a warning and only show the address of the pointer, and the
dereferenced string if it can be safely retrieved with a
strncpy_from_kernel_nofault() call.
In order to not have to copy the parsing of vsnprintf() formats, or even
exporting its code, the verifier relies on vsnprintf() being able to
modify the va_list that is passed to it, and it remains modified after it
is called. This is the case for some architectures like x86_64, but other
architectures like x86_32 pass the va_list to vsnprintf() as a value not a
reference, and the verifier can not use it to parse the non string
arguments. Thus, at boot up, it is checked if vsnprintf() modifies the
passed in va_list or not, and a static branch will disable the verifier if
it's not compatible.
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Trace events record data into the ring buffer at the time of the event. The
trace event has a printf logic to display the recorded data at a much later
time when the user reads the trace file. This makes using dereferencing
pointers unsafe if the dereferenced pointer points to the original source.
The safe way to handle this is to create an array within the trace event and
copy the source into the array. Then the dereference pointer may point to
that array.
As this is a easy mistake to make, a check is added to examine all trace
event print fmts to make sure that they are safe to use. This only checks
the various %p* dereferenced pointers like %pB, %pR, etc. It does not handle
dereferencing of strings, as there are some use cases that are OK to
dereference the source. That will be dealt with differently.
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Add a tracing_event_time_stamp() API that checks if the event passed in is
not on the ring buffer but a pointer to the per CPU trace_buffered_event
which does not have its time stamp set yet.
If it is a pointer to the trace_buffered_event, then just return the
current time stamp that the ring buffer would produce.
Otherwise, return the time stamp from the event.
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
The ring_buffer_event_time_stamp() must be only called by an event that has
not been committed yet, and is on the buffer that is passed in. This was
used to help debug converting the histogram logic over to using the new
time stamp code, and was proven to be very useful.
Add a verifier that can check that this is the case, and extra WARN_ONs to
catch unexpected use cases.
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Currently, the trace histograms relies on it using absolute time stamps to
trigger the tracing to not use the temp buffer if filters are set. That's
because the histograms need the full timestamp that is saved in the ring
buffer. That is no longer the case, as the ring_buffer_event_time_stamp()
can now return the time stamp for all events without all triggering a full
absolute time stamp.
Now that the absolute time stamp is an unrelated dependency to not using
the filters. There's nothing about having absolute timestamps to keep from
using the filter buffer. Instead, change the interface to explicitly state
to disable filter buffering that the histogram logic can use.
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
all events
Currently, ring_buffer_event_time_stamp() only returns an accurate time
stamp of the event if it has an absolute extended time stamp attached to
it. To make it more robust, use the event_stamp() in case the event does
not have an absolute value attached to it.
This will allow ring_buffer_event_time_stamp() to be used in more cases
than just histograms, and it will also allow histograms to not require
including absolute values all the time.
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
The ring_buffer_event_time_stamp() is going to be updated to extract the
time stamp for the event without needing it to be set to have absolute
values for all events. But to do so, it needs the buffer that the event is
on as the buffer saves information for the event before it is committed to
the buffer.
If the trace buffer is disabled, a temporary buffer is used, and there's
no access to this buffer from the current histogram triggers, even though
it is passed to the trace event code.
Pass the buffer that the event is on all the way down to the histogram
triggers.
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Add a place to save the current event time stamp for each level of nesting.
This will be used to retrieve the time stamp of the current event before it
is committed.
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
The exported use of ring_buffer_event_time_stamp() is going to become
different than how it is used internally. Move the internal logic out into a
static function called rb_event_time_stamp(), and have the internal callers
call that instead.
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095.
This patch causes a panic when rebooting my Dell Poweredge r440. I do
not have the full panic log as it's lost at that stage of the reboot and
I do not have a serial console. Reverting this patch makes my system
able to reboot again.
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
We want to change the current ndo_xdp_xmit drop semantics because it will
allow us to implement better queue overflow handling. This is working
towards the larger goal of a XDP TX queue-hook. Move XDP_REDIRECT error
path handling from each XDP ethernet driver to devmap code. According to
the new APIs, the driver running the ndo_xdp_xmit pointer, will break tx
loop whenever the hw reports a tx error and it will just return to devmap
caller the number of successfully transmitted frames. It will be devmap
responsibility to free dropped frames.
Move each XDP ndo_xdp_xmit capable driver to the new APIs:
- veth
- virtio-net
- mvneta
- mvpp2
- socionext
- amazon ena
- bnxt
- freescale (dpaa2, dpaa)
- xen-frontend
- qede
- ice
- igb
- ixgbe
- i40e
- mlx5
- ti (cpsw, cpsw-new)
- tun
- sfc
Signed-off-by: Lorenzo Bianconi <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Ioana Ciornei <[email protected]>
Reviewed-by: Ilias Apalodimas <[email protected]>
Reviewed-by: Camelia Groza <[email protected]>
Acked-by: Edward Cree <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Acked-by: Shay Agroskin <[email protected]>
Link: https://lore.kernel.org/bpf/ed670de24f951cfd77590decf0229a0ad7fd12f6.1615201152.git.lorenzo@kernel.org
|
|
There is no need to keep the dentry pointer around for the created
debugfs file, as it is only needed when removing it from the system.
When it is to be removed, ask debugfs itself for the pointer, to save on
storage and make things a bit simpler.
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
The synchronize_rcu_tasks() will not wait for such tasks to complete.
In such case the trampoline image will be freed and when the task
wakes up the return IP will point to freed memory causing the crash.
Solve this by adding percpu_ref_get/put for the duration of trampoline
and separate trampoline vs its image life times.
The "half page" optimization has to be removed, since
first_half->second_half->first_half transition cannot be guaranteed to
complete in deterministic time. Every trampoline update becomes a new image.
The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
call_rcu_tasks. Together they will wait for the original function and
trampoline asm to complete. The trampoline is patched from nop to jmp to skip
fexit progs. They are freed independently from the trampoline. The image with
fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
will wait for both sleepable and non-sleepable progs to complete.
Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
Reported-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Paul E. McKenney <[email protected]> # for RCU
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Given we know the max possible value of ptr_limit at the time of retrieving
the latter, add basic assertions, so that the verifier can bail out if
anything looks odd and reject the program. Nothing triggered this so far,
but it also does not hurt to have these.
Signed-off-by: Piotr Krysiuk <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
|
|
The if condition in irq_matrix_reserve() can be much simpler.
While at it fix a typo in the comment.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Instead of having the mov32 with aux->alu_limit - 1 immediate, move this
operation to retrieve_ptr_limit() instead to simplify the logic and to
allow for subsequent sanity boundary checks inside retrieve_ptr_limit().
This avoids in future that at the time of the verifier masking rewrite
we'd run into an underflow which would not sign extend due to the nature
of mov32 instruction.
Signed-off-by: Piotr Krysiuk <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
|
|
retrieve_ptr_limit() computes the ptr_limit for registers with stack and
map_value type. ptr_limit is the size of the memory area that is still
valid / in-bounds from the point of the current position and direction
of the operation (add / sub). This size will later be used for masking
the operation such that attempting out-of-bounds access in the speculative
domain is redirected to remain within the bounds of the current map value.
When masking to the right the size is correct, however, when masking to
the left, the size is off-by-one which would lead to an incorrect mask
and thus incorrect arithmetic operation in the non-speculative domain.
Piotr found that if the resulting alu_limit value is zero, then the
BPF_MOV32_IMM() from the fixup_bpf_calls() rewrite will end up loading
0xffffffff into AX instead of sign-extending to the full 64 bit range,
and as a result, this allows abuse for executing speculatively out-of-
bounds loads against 4GB window of address space and thus extracting the
contents of kernel memory via side-channel.
Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
Signed-off-by: Piotr Krysiuk <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
|
|
The purpose of this patch is to streamline error propagation and in particular
to propagate retrieve_ptr_limit() errors for pointer types that are not defining
a ptr_limit such that register-based alu ops against these types can be rejected.
The main rationale is that a gap has been identified by Piotr in the existing
protection against speculatively out-of-bounds loads, for example, in case of
ctx pointers, unprivileged programs can still perform pointer arithmetic. This
can be abused to execute speculatively out-of-bounds loads without restrictions
and thus extract contents of kernel memory.
Fix this by rejecting unprivileged programs that attempt any pointer arithmetic
on unprotected pointer types. The two affected ones are pointer to ctx as well
as pointer to map. Field access to a modified ctx' pointer is rejected at a
later point in time in the verifier, and 7c6967326267 ("bpf: Permit map_ptr
arithmetic with opcode add and offset 0") only relevant for root-only use cases.
Risk of unprivileged program breakage is considered very low.
Fixes: 7c6967326267 ("bpf: Permit map_ptr arithmetic with opcode add and offset 0")
Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
Signed-off-by: Piotr Krysiuk <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
|
|
On RT a task which has soft interrupts disabled can block on a lock and
schedule out to idle while soft interrupts are pending. This triggers the
warning in the NOHZ idle code which complains about going idle with pending
soft interrupts. But as the task is blocked soft interrupt processing is
temporarily blocked as well which means that such a warning is a false
positive.
To prevent that check the per CPU state which indicates that a scheduled
out task has soft interrupts disabled.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Provide a local lock based serialization for soft interrupts on RT which
allows the local_bh_disabled() sections and servicing soft interrupts to be
preemptible.
Provide the necessary inline helpers which allow to reuse the bulk of the
softirq processing code.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
To allow reuse of the bulk of softirq processing code for RT and to avoid
#ifdeffery all over the place, split protections for various code sections
out into inline helpers so the RT variant can just replace them in one go.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
vtime_account_irq and irqtime_account_irq() base checks on preempt_count()
which fails on RT because preempt_count() does not contain the softirq
accounting which is seperate on RT.
These checks do not need the full preempt count as they only operate on the
hard and softirq sections.
Use irq_count() instead which provides the correct value on both RT and non
RT kernels. The compiler is clever enough to fold the masking for !RT:
99b: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax
- 9a2: 25 ff ff ff 7f and $0x7fffffff,%eax
+ 9a2: 25 00 ff ff 00 and $0xffff00,%eax
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
tasklet_unlock_spin_wait() spin waits for the TASKLET_STATE_SCHED bit in
the tasklet state to be cleared. This works on !RT nicely because the
corresponding execution can only happen on a different CPU.
On RT softirq processing is preemptible, therefore a task preempting the
softirq processing thread can spin forever.
Prevent this by invoking local_bh_disable()/enable() inside the loop. In
case that the softirq processing thread was preempted by the current task,
current will block on the local lock which yields the CPU to the preempted
softirq processing thread. If the tasklet is processed on a different CPU
then the local_bh_disable()/enable() pair is just a waste of processor
cycles.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
tasklet_kill() spin waits for TASKLET_STATE_SCHED to be cleared invoking
yield() from inside the loop. yield() is an ill defined mechanism and the
result might still be wasting CPU cycles in a tight loop which is
especially painful in a guest when the CPU running the tasklet is scheduled
out.
tasklet_kill() is used in teardown paths and not performance critical at
all. Replace the spin wait with wait_var_event().
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
tasklet_unlock_wait() spin waits for TASKLET_STATE_RUN to be cleared. This
is wasting CPU cycles in a tight loop which is especially painful in a
guest when the CPU running the tasklet is scheduled out.
tasklet_unlock_wait() is invoked from tasklet_kill() which is used in
teardown paths and not performance critical at all. Replace the spin wait
with wait_var_event().
There are no users of tasklet_unlock_wait() which are invoked from atomic
contexts. The usage in tasklet_disable() has been replaced temporarily with
the spin waiting variant until the atomic users are fixed up and will be
converted to the sleep wait variant later.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
For userspace checkpoint and restore (C/R) a way of getting process state
containing RSEQ configuration is needed.
There are two ways this information is going to be used:
- to re-enable RSEQ for threads which had it enabled before C/R
- to detect if a thread was in a critical section during C/R
Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
using the address registered before C/R.
Detection whether the thread is in a critical section during C/R is needed
to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
before registers are dumped itself doesn't cause RSEQ abort.
Restoring the instruction pointer within the critical section is
problematic because rseq_cs may get cleared before the control is passed
to the migrated application code leading to RSEQ invariants not being
preserved. C/R code will use RSEQ ABI address to find the abort handler
to which the instruction pointer needs to be set.
To achieve above goals expose the RSEQ ABI address and the signature value
with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.
This new ptrace request can also be used by debuggers so they are aware
of stops within restartable sequences in progress.
Signed-off-by: Piotr Figiel <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Michal Miroslaw <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Wire up the quotactl_path syscall added in the previous patch.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sascha Hauer <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
|
|
Replace BUG() with WARN_ONCE() on wrong tasklet state, in order to:
- increase the verbosity / aid in debugging
- avoid fatal/unrecoverable state
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Dirk Behme <[email protected]>
Signed-off-by: Eugeniu Rosca <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The use_ww_ctx flag is passed to mutex_optimistic_spin(), but the
function doesn't use it. The frequent use of the (use_ww_ctx && ww_ctx)
combination is repetitive.
In fact, ww_ctx should not be used at all if !use_ww_ctx. Simplify
ww_mutex code by dropping use_ww_ctx from mutex_optimistic_spin() an
clear ww_ctx if !use_ww_ctx. In this way, we can replace (use_ww_ctx &&
ww_ctx) by just (ww_ctx).
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Davidlohr Bueso <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
s/folowing/following/
Signed-off-by: Bhaskar Chowdhury <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Lift the double initialization protection from xen-swiotlb to the core
code to avoid exposing too many swiotlb internals. Also upgrade the
check to a warning as it should not happen.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
|
|
Split swiotlb_tbl_sync_single into two separate funtions for the to device
and to cpu synchronization.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
|
|
Move the code to find and validate the original buffer address and size
from the callers into swiotlb_bounce. This means a tiny bit of extra
work in the swiotlb_map path, but avoids code duplication and a leads to
a better code structure.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
|
|
Now that swiotlb remembers the allocation size there is no need to pass
it back to swiotlb_tbl_unmap_single.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
|
|
The following sequence of commands:
register_ftrace_direct(ip, addr1);
modify_ftrace_direct(ip, addr1, addr2);
unregister_ftrace_direct(ip, addr2);
will cause the kernel to warn:
[ 30.179191] WARNING: CPU: 2 PID: 1961 at kernel/trace/ftrace.c:5223 unregister_ftrace_direct+0x130/0x150
[ 30.180556] CPU: 2 PID: 1961 Comm: test_progs W O 5.12.0-rc2-00378-g86bc10a0a711-dirty #3246
[ 30.182453] RIP: 0010:unregister_ftrace_direct+0x130/0x150
When modify_ftrace_direct() changes the addr from old to new it should update
the addr stored in ftrace_direct_funcs. Otherwise the final
unregister_ftrace_direct() won't find the address and will cause the splat.
Fixes: 0567d6809182 ("ftrace: Add modify_ftrace_direct()")
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Preparation for fixing get_nr_restart_syscall() on X86 for COMPAT.
Add a new helper which sets restart_block->fn and calls a dummy
arch_set_restart_data() helper.
Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code")
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
Currently, the lockdown state is queried unconditionally, even though
its result is used only if the PERF_SAMPLE_REGS_INTR bit is set in
attr.sample_type. While that doesn't matter in case of the Lockdown LSM,
it causes trouble with the SELinux's lockdown hook implementation.
SELinux implements the locked_down hook with a check whether the current
task's type has the corresponding "lockdown" class permission
("integrity" or "confidentiality") allowed in the policy. This means
that calling the hook when the access control decision would be ignored
generates a bogus permission check and audit record.
Fix this by checking sample_type first and only calling the hook when
its result would be honored.
Fixes: b0c8fdc7fdb7 ("lockdown: Lock down perf when in confidentiality mode")
Signed-off-by: Ondrej Mosnacek <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Paul Moore <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
For cpu events, it'd better allocating them in the corresponding node
memory as they would be mostly accessed by the target cpu. Although
perf tools sets the cpu affinity before calling perf_event_open, there
are places it doesn't (notably perf record) and we should consider
other external users too.
Signed-off-by: Namhyung Kim <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The kernel can allocate a lot of struct perf_event when profiling. For
example, 256 cpu x 8 events x 20 cgroups = 40K instances of the struct
would be allocated on a large system.
The size of struct perf_event in my setup is 1152 byte. As it's
allocated by kmalloc, the actual allocation size would be rounded up
to 2K.
Then there's 896 byte (~43%) of waste per instance resulting in total
~35MB with 40K instances. We can create a dedicated kmem_cache to
avoid such a big unnecessary memory consumption.
With this change, I can see below (note this machine has 112 cpus).
# grep perf_event /proc/slabinfo
perf_event 224 784 1152 7 2 : tunables 24 12 8 : slabdata 112 112 0
The sixth column is pages-per-slab which is 2, and the fifth column is
obj-per-slab which is 7. Thus actually it can use 1152 x 7 = 8064
byte in the 8K, and wasted memory is (8192 - 8064) / 7 = ~18 byte per
instance.
Signed-off-by: Namhyung Kim <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
I found the ring buffer pages are allocated in the node but the ring
buffer itself is not. Let's convert it to use kzalloc_node() too.
Signed-off-by: Namhyung Kim <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The sparse tool complains as follows:
kernel/bpf/bpf_task_storage.c:23:1: warning:
symbol '__pcpu_scope_bpf_task_storage_busy' was not declared. Should it be static?
This symbol is not used outside of bpf_task_storage.c, so this
commit marks it static.
Fixes: bc235cdb423a ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Wei Yongjun <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
A typo is found out by codespell tool in 34th lines of hashtab.c:
$ codespell ./kernel/bpf/
./hashtab.c:34 : differrent ==> different
Fix a typo found by codespell.
Signed-off-by: Liu xuzhi <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
fanotify has some hardcoded limits. The only APIs to escape those limits
are FAN_UNLIMITED_QUEUE and FAN_UNLIMITED_MARKS.
Allow finer grained tuning of the system limits via sysfs tunables under
/proc/sys/fs/fanotify, similar to tunables under /proc/sys/fs/inotify,
with some minor differences.
- max_queued_events - global system tunable for group queue size limit.
Like the inotify tunable with the same name, it defaults to 16384 and
applies on initialization of a new group.
- max_user_marks - user ns tunable for marks limit per user.
Like the inotify tunable named max_user_watches, on a machine with
sufficient RAM and it defaults to 1048576 in init userns and can be
further limited per containing user ns.
- max_user_groups - user ns tunable for number of groups per user.
Like the inotify tunable named max_user_instances, it defaults to 128
in init userns and can be further limited per containing user ns.
The slightly different tunable names used for fanotify are derived from
the "group" and "mark" terminology used in the fanotify man pages and
throughout the code.
Considering the fact that the default value for max_user_instances was
increased in kernel v5.10 from 8192 to 1048576, leaving the legacy
fanotify limit of 8192 marks per group in addition to the max_user_marks
limit makes little sense, so the per group marks limit has been removed.
Note that when a group is initialized with FAN_UNLIMITED_MARKS, its own
marks are not accounted in the per user marks account, so in effect the
limit of max_user_marks is only for the collection of groups that are
not initialized with FAN_UNLIMITED_MARKS.
Link: https://lore.kernel.org/r/[email protected]
Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
|
|
Fix typos in kernel doc, otherwise validation script complains:
.../irq_sim.c:170: warning: Function parameter or member 'fwnode' not described in 'irq_domain_create_sim'
.../irq_sim.c:170: warning: Excess function parameter 'fnode' description in 'irq_domain_create_sim'
.../irq_sim.c:240: warning: Function parameter or member 'fwnode' not described in 'devm_irq_domain_create_sim'
.../irq_sim.c:240: warning: Excess function parameter 'fnode' description in 'devm_irq_domain_create_sim'
Signed-off-by: Andy Shevchenko <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
No functional change.
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Ever since RCU was converted to softirq, it has no users.
Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Paul E. McKenney <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Those tracing calls don't need to be under ->nocb_lock. This commit
therefore moves them outside of that lock.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
This commit removes a stale comment claiming that the cblist must be
empty before changing the offloading state. This claim was correct back
when the offloaded state was defined exclusively at boot.
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Currently, the bypass is flushed at the very last moment in the
deoffloading procedure. However, this approach leads to a larger state
space than would be preferred. This commit therefore disables the
bypass at soon as the deoffloading procedure begins, then flushes it.
This guarantees that the bypass remains empty and thus out of the way
of the deoffloading procedure.
Symmetrically, this commit waits to enable the bypass until the offloading
procedure has completed.
Reported-by: Paul E. McKenney <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|