aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-03-18cpufreq: schedutil: Call sugov_update_next_freq() before check to ↵Yue Hu1-17/+12
fast_switch_enabled Note that sugov_update_next_freq() may return false, that means the caller sugov_fast_switch() will do nothing except fast switch check. Similarly, sugov_deferred_update() also has unnecessary operations of raw_spin_{lock,unlock} in sugov_update_single_freq() for that case. So, let's call sugov_update_next_freq() before the fast switch check to avoid unnecessary behaviors above. Accordingly, update interface definition to sugov_deferred_update() and remove sugov_fast_switch() since we will call cpufreq_driver_fast_switch() directly instead. Signed-off-by: Yue Hu <[email protected]> Acked-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2021-03-18tracing: Add a verifier to check string pointers for trace eventsSteven Rostedt (VMware)4-1/+215
It is a common mistake for someone writing a trace event to save a pointer to a string in the TP_fast_assign() and then display that string pointer in the TP_printk() with %s. The problem is that those two events may happen a long time apart, where the source of the string may no longer exist. The proper way to handle displaying any string that is not guaranteed to be in the kernel core rodata section, is to copy it into the ring buffer via the __string(), __assign_str() and __get_str() helper macros. Add a check at run time while displaying the TP_printk() of events to make sure that every %s referenced is safe to dereference, and if it is not, trigger a warning and only show the address of the pointer, and the dereferenced string if it can be safely retrieved with a strncpy_from_kernel_nofault() call. In order to not have to copy the parsing of vsnprintf() formats, or even exporting its code, the verifier relies on vsnprintf() being able to modify the va_list that is passed to it, and it remains modified after it is called. This is the case for some architectures like x86_64, but other architectures like x86_32 pass the va_list to vsnprintf() as a value not a reference, and the verifier can not use it to parse the non string arguments. Thus, at boot up, it is checked if vsnprintf() modifies the passed in va_list or not, and a static branch will disable the verifier if it's not compatible. Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-03-18tracing: Add check of trace event print fmts for dereferencing pointersSteven Rostedt (VMware)1-0/+210
Trace events record data into the ring buffer at the time of the event. The trace event has a printf logic to display the recorded data at a much later time when the user reads the trace file. This makes using dereferencing pointers unsafe if the dereferenced pointer points to the original source. The safe way to handle this is to create an array within the trace event and copy the source into the array. Then the dereference pointer may point to that array. As this is a easy mistake to make, a check is added to examine all trace event print fmts to make sure that they are safe to use. This only checks the various %p* dereferenced pointers like %pB, %pR, etc. It does not handle dereferencing of strings, as there are some use cases that are OK to dereference the source. That will be dealt with differently. Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-03-18tracing: Add tracing_event_time_stamp() APISteven Rostedt (VMware)2-0/+9
Add a tracing_event_time_stamp() API that checks if the event passed in is not on the ring buffer but a pointer to the per CPU trace_buffered_event which does not have its time stamp set yet. If it is a pointer to the trace_buffered_event, then just return the current time stamp that the ring buffer would produce. Otherwise, return the time stamp from the event. Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-03-18ring-buffer: Add verifier for using ring_buffer_event_time_stamp()Steven Rostedt (VMware)1-4/+52
The ring_buffer_event_time_stamp() must be only called by an event that has not been committed yet, and is on the buffer that is passed in. This was used to help debug converting the histogram logic over to using the new time stamp code, and was proven to be very useful. Add a verifier that can check that this is the case, and extra WARN_ONs to catch unexpected use cases. Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-03-18tracing: Use a no_filter_buffering_ref to stop using the filter bufferSteven Rostedt (VMware)3-21/+17
Currently, the trace histograms relies on it using absolute time stamps to trigger the tracing to not use the temp buffer if filters are set. That's because the histograms need the full timestamp that is saved in the ring buffer. That is no longer the case, as the ring_buffer_event_time_stamp() can now return the time stamp for all events without all triggering a full absolute time stamp. Now that the absolute time stamp is an unrelated dependency to not using the filters. There's nothing about having absolute timestamps to keep from using the filter buffer. Instead, change the interface to explicitly state to disable filter buffering that the histogram logic can use. Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-03-18ring-buffer: Allow ring_buffer_event_time_stamp() to return time stamp of ↵Steven Rostedt (VMware)2-16/+46
all events Currently, ring_buffer_event_time_stamp() only returns an accurate time stamp of the event if it has an absolute extended time stamp attached to it. To make it more robust, use the event_stamp() in case the event does not have an absolute value attached to it. This will allow ring_buffer_event_time_stamp() to be used in more cases than just histograms, and it will also allow histograms to not require including absolute values all the time. Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-03-18tracing: Pass buffer of event to trigger operationsSteven Rostedt (VMware)4-51/+92
The ring_buffer_event_time_stamp() is going to be updated to extract the time stamp for the event without needing it to be set to have absolute values for all events. But to do so, it needs the buffer that the event is on as the buffer saves information for the event before it is committed to the buffer. If the trace buffer is disabled, a temporary buffer is used, and there's no access to this buffer from the current histogram triggers, even though it is passed to the trace event code. Pass the buffer that the event is on all the way down to the histogram triggers. Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-03-18ring-buffer: Add a event_stamp to cpu_buffer for each level of nestingSteven Rostedt (VMware)1-1/+10
Add a place to save the current event time stamp for each level of nesting. This will be used to retrieve the time stamp of the current event before it is committed. Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-03-18ring-buffer: Separate out internal use of ring_buffer_event_time_stamp()Steven Rostedt (VMware)1-18/+23
The exported use of ring_buffer_event_time_stamp() is going to become different than how it is used internally. Move the internal logic out into a static function called rb_event_time_stamp(), and have the internal callers call that instead. Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-03-18Revert "PM: ACPI: reboot: Use S5 for reboot"Josef Bacik1-2/+0
This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095. This patch causes a panic when rebooting my Dell Poweredge r440. I do not have the full panic log as it's lost at that stage of the reboot and I do not have a serial console. Reverting this patch makes my system able to reboot again. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2021-03-18bpf, devmap: Move drop error path to devmap for XDP_REDIRECTLorenzo Bianconi1-18/+12
We want to change the current ndo_xdp_xmit drop semantics because it will allow us to implement better queue overflow handling. This is working towards the larger goal of a XDP TX queue-hook. Move XDP_REDIRECT error path handling from each XDP ethernet driver to devmap code. According to the new APIs, the driver running the ndo_xdp_xmit pointer, will break tx loop whenever the hw reports a tx error and it will just return to devmap caller the number of successfully transmitted frames. It will be devmap responsibility to free dropped frames. Move each XDP ndo_xdp_xmit capable driver to the new APIs: - veth - virtio-net - mvneta - mvpp2 - socionext - amazon ena - bnxt - freescale (dpaa2, dpaa) - xen-frontend - qede - ice - igb - ixgbe - i40e - mlx5 - ti (cpsw, cpsw-new) - tun - sfc Signed-off-by: Lorenzo Bianconi <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Ioana Ciornei <[email protected]> Reviewed-by: Ilias Apalodimas <[email protected]> Reviewed-by: Camelia Groza <[email protected]> Acked-by: Edward Cree <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Shay Agroskin <[email protected]> Link: https://lore.kernel.org/bpf/ed670de24f951cfd77590decf0229a0ad7fd12f6.1615201152.git.lorenzo@kernel.org
2021-03-18time/debug: Remove dentry pointer for debugfsGreg Kroah-Hartman1-4/+3
There is no need to keep the dentry pointer around for the created debugfs file, as it is only needed when removing it from the system. When it is to be removed, ask debugfs itself for the pointer, to save on storage and make things a bit simpler. Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-18bpf: Fix fexit trampoline.Alexei Starovoitov3-53/+171
The fexit/fmod_ret programs can be attached to kernel functions that can sleep. The synchronize_rcu_tasks() will not wait for such tasks to complete. In such case the trampoline image will be freed and when the task wakes up the return IP will point to freed memory causing the crash. Solve this by adding percpu_ref_get/put for the duration of trampoline and separate trampoline vs its image life times. The "half page" optimization has to be removed, since first_half->second_half->first_half transition cannot be guaranteed to complete in deterministic time. Every trampoline update becomes a new image. The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and call_rcu_tasks. Together they will wait for the original function and trampoline asm to complete. The trampoline is patched from nop to jmp to skip fexit progs. They are freed independently from the trampoline. The image with fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which will wait for both sleepable and non-sleepable progs to complete. Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline") Reported-by: Andrii Nakryiko <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Paul E. McKenney <[email protected]> # for RCU Link: https://lore.kernel.org/bpf/[email protected]
2021-03-17bpf: Add sanity check for upper ptr_limitPiotr Krysiuk1-3/+8
Given we know the max possible value of ptr_limit at the time of retrieving the latter, add basic assertions, so that the verifier can bail out if anything looks odd and reject the program. Nothing triggered this so far, but it also does not hurt to have these. Signed-off-by: Piotr Krysiuk <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-03-17irq: Simplify condition in irq_matrix_reserve()Juergen Gross1-3/+2
The if condition in irq_matrix_reserve() can be much simpler. While at it fix a typo in the comment. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-17bpf: Simplify alu_limit masking for pointer arithmeticPiotr Krysiuk1-5/+5
Instead of having the mov32 with aux->alu_limit - 1 immediate, move this operation to retrieve_ptr_limit() instead to simplify the logic and to allow for subsequent sanity boundary checks inside retrieve_ptr_limit(). This avoids in future that at the time of the verifier masking rewrite we'd run into an underflow which would not sign extend due to the nature of mov32 instruction. Signed-off-by: Piotr Krysiuk <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-03-17bpf: Fix off-by-one for area size in creating mask to leftPiotr Krysiuk1-2/+2
retrieve_ptr_limit() computes the ptr_limit for registers with stack and map_value type. ptr_limit is the size of the memory area that is still valid / in-bounds from the point of the current position and direction of the operation (add / sub). This size will later be used for masking the operation such that attempting out-of-bounds access in the speculative domain is redirected to remain within the bounds of the current map value. When masking to the right the size is correct, however, when masking to the left, the size is off-by-one which would lead to an incorrect mask and thus incorrect arithmetic operation in the non-speculative domain. Piotr found that if the resulting alu_limit value is zero, then the BPF_MOV32_IMM() from the fixup_bpf_calls() rewrite will end up loading 0xffffffff into AX instead of sign-extending to the full 64 bit range, and as a result, this allows abuse for executing speculatively out-of- bounds loads against 4GB window of address space and thus extracting the contents of kernel memory via side-channel. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Signed-off-by: Piotr Krysiuk <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-03-17bpf: Prohibit alu ops for pointer types not defining ptr_limitPiotr Krysiuk1-6/+10
The purpose of this patch is to streamline error propagation and in particular to propagate retrieve_ptr_limit() errors for pointer types that are not defining a ptr_limit such that register-based alu ops against these types can be rejected. The main rationale is that a gap has been identified by Piotr in the existing protection against speculatively out-of-bounds loads, for example, in case of ctx pointers, unprivileged programs can still perform pointer arithmetic. This can be abused to execute speculatively out-of-bounds loads without restrictions and thus extract contents of kernel memory. Fix this by rejecting unprivileged programs that attempt any pointer arithmetic on unprotected pointer types. The two affected ones are pointer to ctx as well as pointer to map. Field access to a modified ctx' pointer is rejected at a later point in time in the verifier, and 7c6967326267 ("bpf: Permit map_ptr arithmetic with opcode add and offset 0") only relevant for root-only use cases. Risk of unprivileged program breakage is considered very low. Fixes: 7c6967326267 ("bpf: Permit map_ptr arithmetic with opcode add and offset 0") Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Signed-off-by: Piotr Krysiuk <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-03-17tick/sched: Prevent false positive softirq pending warnings on RTThomas Gleixner2-1/+16
On RT a task which has soft interrupts disabled can block on a lock and schedule out to idle while soft interrupts are pending. This triggers the warning in the NOHZ idle code which complains about going idle with pending soft interrupts. But as the task is blocked soft interrupt processing is temporarily blocked as well which means that such a warning is a false positive. To prevent that check the per CPU state which indicates that a scheduled out task has soft interrupts disabled. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-17softirq: Make softirq control and processing RT awareThomas Gleixner1-7/+181
Provide a local lock based serialization for soft interrupts on RT which allows the local_bh_disabled() sections and servicing soft interrupts to be preemptible. Provide the necessary inline helpers which allow to reuse the bulk of the softirq processing code. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-17softirq: Move various protections into inline helpersThomas Gleixner1-7/+32
To allow reuse of the bulk of softirq processing code for RT and to avoid #ifdeffery all over the place, split protections for various code sections out into inline helpers so the RT variant can just replace them in one go. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-17irqtime: Make accounting correct on RTThomas Gleixner1-2/+2
vtime_account_irq and irqtime_account_irq() base checks on preempt_count() which fails on RT because preempt_count() does not contain the softirq accounting which is seperate on RT. These checks do not need the full preempt count as they only operate on the hard and softirq sections. Use irq_count() instead which provides the correct value on both RT and non RT kernels. The compiler is clever enough to fold the masking for !RT: 99b: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax - 9a2: 25 ff ff ff 7f and $0x7fffffff,%eax + 9a2: 25 00 ff ff 00 and $0xffff00,%eax Reported-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-17tasklets: Prevent tasklet_unlock_spin_wait() deadlock on RTThomas Gleixner1-1/+27
tasklet_unlock_spin_wait() spin waits for the TASKLET_STATE_SCHED bit in the tasklet state to be cleared. This works on !RT nicely because the corresponding execution can only happen on a different CPU. On RT softirq processing is preemptible, therefore a task preempting the softirq processing thread can spin forever. Prevent this by invoking local_bh_disable()/enable() inside the loop. In case that the softirq processing thread was preempted by the current task, current will block on the local lock which yields the CPU to the preempted softirq processing thread. If the tasklet is processed on a different CPU then the local_bh_disable()/enable() pair is just a waste of processor cycles. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-17tasklets: Replace spin wait in tasklet_kill()Peter Zijlstra1-9/+9
tasklet_kill() spin waits for TASKLET_STATE_SCHED to be cleared invoking yield() from inside the loop. yield() is an ill defined mechanism and the result might still be wasting CPU cycles in a tight loop which is especially painful in a guest when the CPU running the tasklet is scheduled out. tasklet_kill() is used in teardown paths and not performance critical at all. Replace the spin wait with wait_var_event(). Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-17tasklets: Replace spin wait in tasklet_unlock_wait()Peter Zijlstra1-0/+18
tasklet_unlock_wait() spin waits for TASKLET_STATE_RUN to be cleared. This is wasting CPU cycles in a tight loop which is especially painful in a guest when the CPU running the tasklet is scheduled out. tasklet_unlock_wait() is invoked from tasklet_kill() which is used in teardown paths and not performance critical at all. Replace the spin wait with wait_var_event(). There are no users of tasklet_unlock_wait() which are invoked from atomic contexts. The usage in tasklet_disable() has been replaced temporarily with the spin waiting variant until the atomic users are fixed up and will be converted to the sleep wait variant later. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-17rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION requestPiotr Figiel1-0/+25
For userspace checkpoint and restore (C/R) a way of getting process state containing RSEQ configuration is needed. There are two ways this information is going to be used: - to re-enable RSEQ for threads which had it enabled before C/R - to detect if a thread was in a critical section during C/R Since C/R preserves TLS memory and addresses RSEQ ABI will be restored using the address registered before C/R. Detection whether the thread is in a critical section during C/R is needed to enforce behavior of RSEQ abort during C/R. Attaching with ptrace() before registers are dumped itself doesn't cause RSEQ abort. Restoring the instruction pointer within the critical section is problematic because rseq_cs may get cleared before the control is passed to the migrated application code leading to RSEQ invariants not being preserved. C/R code will use RSEQ ABI address to find the abort handler to which the instruction pointer needs to be set. To achieve above goals expose the RSEQ ABI address and the signature value with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION. This new ptrace request can also be used by debuggers so they are aware of stops within restartable sequences in progress. Signed-off-by: Piotr Figiel <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Michal Miroslaw <[email protected]> Reviewed-by: Mathieu Desnoyers <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-03-17quota: wire up quotactl_pathSascha Hauer1-0/+1
Wire up the quotactl_path syscall added in the previous patch. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sascha Hauer <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jan Kara <[email protected]>
2021-03-17softirq: s/BUG/WARN_ONCE/ on tasklet SCHED state not setDirk Behme1-7/+18
Replace BUG() with WARN_ONCE() on wrong tasklet state, in order to: - increase the verbosity / aid in debugging - avoid fatal/unrecoverable state Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Dirk Behme <[email protected]> Signed-off-by: Eugeniu Rosca <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-17locking/ww_mutex: Simplify use_ww_ctx & ww_ctx handlingWaiman Long1-11/+14
The use_ww_ctx flag is passed to mutex_optimistic_spin(), but the function doesn't use it. The frequent use of the (use_ww_ctx && ww_ctx) combination is repetitive. In fact, ww_ctx should not be used at all if !use_ww_ctx. Simplify ww_mutex code by dropping use_ww_ctx from mutex_optimistic_spin() an clear ww_ctx if !use_ww_ctx. In this way, we can replace (use_ww_ctx && ww_ctx) by just (ww_ctx). Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-17locking/rwsem: Fix comment typoBhaskar Chowdhury1-1/+1
s/folowing/following/ Signed-off-by: Bhaskar Chowdhury <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Randy Dunlap <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-17swiotlb: lift the double initialization protection from xen-swiotlbChristoph Hellwig1-0/+8
Lift the double initialization protection from xen-swiotlb to the core code to avoid exposing too many swiotlb internals. Also upgrade the check to a warning as it should not happen. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
2021-03-17swiotlb: split swiotlb_tbl_sync_singleChristoph Hellwig3-25/+21
Split swiotlb_tbl_sync_single into two separate funtions for the to device and to cpu synchronization. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
2021-03-17swiotlb: move orig addr and size validation into swiotlb_bounceChristoph Hellwig1-36/+23
Move the code to find and validate the original buffer address and size from the callers into swiotlb_bounce. This means a tiny bit of extra work in the swiotlb_map path, but avoids code duplication and a leads to a better code structure. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
2021-03-17swiotlb: remove the alloc_size parameter to swiotlb_tbl_unmap_singleChristoph Hellwig2-23/+24
Now that swiotlb remembers the allocation size there is no need to pass it back to swiotlb_tbl_unmap_single. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
2021-03-17ftrace: Fix modify_ftrace_direct.Alexei Starovoitov1-5/+38
The following sequence of commands: register_ftrace_direct(ip, addr1); modify_ftrace_direct(ip, addr1, addr2); unregister_ftrace_direct(ip, addr2); will cause the kernel to warn: [ 30.179191] WARNING: CPU: 2 PID: 1961 at kernel/trace/ftrace.c:5223 unregister_ftrace_direct+0x130/0x150 [ 30.180556] CPU: 2 PID: 1961 Comm: test_progs W O 5.12.0-rc2-00378-g86bc10a0a711-dirty #3246 [ 30.182453] RIP: 0010:unregister_ftrace_direct+0x130/0x150 When modify_ftrace_direct() changes the addr from old to new it should update the addr stored in ftrace_direct_funcs. Otherwise the final unregister_ftrace_direct() won't find the address and will cause the splat. Fixes: 0567d6809182 ("ftrace: Add modify_ftrace_direct()") Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-03-16kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()Oleg Nesterov4-5/+4
Preparation for fixing get_nr_restart_syscall() on X86 for COMPAT. Add a new helper which sets restart_block->fn and calls a dummy arch_set_restart_data() helper. Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code") Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2021-03-16perf/core: Fix unconditional security_locked_down() callOndrej Mosnacek1-6/+6
Currently, the lockdown state is queried unconditionally, even though its result is used only if the PERF_SAMPLE_REGS_INTR bit is set in attr.sample_type. While that doesn't matter in case of the Lockdown LSM, it causes trouble with the SELinux's lockdown hook implementation. SELinux implements the locked_down hook with a check whether the current task's type has the corresponding "lockdown" class permission ("integrity" or "confidentiality") allowed in the policy. This means that calling the hook when the access control decision would be ignored generates a bogus permission check and audit record. Fix this by checking sample_type first and only calling the hook when its result would be honored. Fixes: b0c8fdc7fdb7 ("lockdown: Lock down perf when in confidentiality mode") Signed-off-by: Ondrej Mosnacek <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Paul Moore <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-03-16perf core: Allocate perf_event in the target node memoryNamhyung Kim1-1/+4
For cpu events, it'd better allocating them in the corresponding node memory as they would be mostly accessed by the target cpu. Although perf tools sets the cpu affinity before calling perf_event_open, there are places it doesn't (notably perf record) and we should consider other external users too. Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-03-16perf core: Add a kmem_cache for struct perf_eventNamhyung Kim1-3/+6
The kernel can allocate a lot of struct perf_event when profiling. For example, 256 cpu x 8 events x 20 cgroups = 40K instances of the struct would be allocated on a large system. The size of struct perf_event in my setup is 1152 byte. As it's allocated by kmalloc, the actual allocation size would be rounded up to 2K. Then there's 896 byte (~43%) of waste per instance resulting in total ~35MB with 40K instances. We can create a dedicated kmem_cache to avoid such a big unnecessary memory consumption. With this change, I can see below (note this machine has 112 cpus). # grep perf_event /proc/slabinfo perf_event 224 784 1152 7 2 : tunables 24 12 8 : slabdata 112 112 0 The sixth column is pages-per-slab which is 2, and the fifth column is obj-per-slab which is 7. Thus actually it can use 1152 x 7 = 8064 byte in the 8K, and wasted memory is (8192 - 8064) / 7 = ~18 byte per instance. Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-03-16perf core: Allocate perf_buffer in the target node memoryNamhyung Kim1-3/+6
I found the ring buffer pages are allocated in the node but the ring buffer itself is not. Let's convert it to use kzalloc_node() too. Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-03-16bpf: Make symbol 'bpf_task_storage_busy' staticWei Yongjun1-1/+1
The sparse tool complains as follows: kernel/bpf/bpf_task_storage.c:23:1: warning: symbol '__pcpu_scope_bpf_task_storage_busy' was not declared. Should it be static? This symbol is not used outside of bpf_task_storage.c, so this commit marks it static. Fixes: bc235cdb423a ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-03-16kernel/bpf/: Fix misspellings using codespell toolLiu xuzhi1-1/+1
A typo is found out by codespell tool in 34th lines of hashtab.c: $ codespell ./kernel/bpf/ ./hashtab.c:34 : differrent ==> different Fix a typo found by codespell. Signed-off-by: Liu xuzhi <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-03-16fanotify: configurable limits via sysfsAmir Goldstein2-1/+15
fanotify has some hardcoded limits. The only APIs to escape those limits are FAN_UNLIMITED_QUEUE and FAN_UNLIMITED_MARKS. Allow finer grained tuning of the system limits via sysfs tunables under /proc/sys/fs/fanotify, similar to tunables under /proc/sys/fs/inotify, with some minor differences. - max_queued_events - global system tunable for group queue size limit. Like the inotify tunable with the same name, it defaults to 16384 and applies on initialization of a new group. - max_user_marks - user ns tunable for marks limit per user. Like the inotify tunable named max_user_watches, on a machine with sufficient RAM and it defaults to 1048576 in init userns and can be further limited per containing user ns. - max_user_groups - user ns tunable for number of groups per user. Like the inotify tunable named max_user_instances, it defaults to 128 in init userns and can be further limited per containing user ns. The slightly different tunable names used for fanotify are derived from the "group" and "mark" terminology used in the fanotify man pages and throughout the code. Considering the fact that the default value for max_user_instances was increased in kernel v5.10 from 8192 to 1048576, leaving the legacy fanotify limit of 8192 marks per group in addition to the max_user_marks limit makes little sense, so the per group marks limit has been removed. Note that when a group is initialized with FAN_UNLIMITED_MARKS, its own marks are not accounted in the per user marks account, so in effect the limit of max_user_marks is only for the collection of groups that are not initialized with FAN_UNLIMITED_MARKS. Link: https://lore.kernel.org/r/[email protected] Suggested-by: Jan Kara <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Jan Kara <[email protected]>
2021-03-16genirq/irq_sim: Fix typos in kernel doc (fnode -> fwnode)Andy Shevchenko1-2/+2
Fix typos in kernel doc, otherwise validation script complains: .../irq_sim.c:170: warning: Function parameter or member 'fwnode' not described in 'irq_domain_create_sim' .../irq_sim.c:170: warning: Excess function parameter 'fnode' description in 'irq_domain_create_sim' .../irq_sim.c:240: warning: Function parameter or member 'fwnode' not described in 'devm_irq_domain_create_sim' .../irq_sim.c:240: warning: Excess function parameter 'fnode' description in 'devm_irq_domain_create_sim' Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-16genirq: Fix typos and misspellings in commentsKrzysztof Kozlowski7-13/+13
No functional change. Signed-off-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-16tasklet: Remove tasklet_kill_immediateDavidlohr Bueso1-32/+0
Ever since RCU was converted to softirq, it has no users. Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-03-15rcu/nocb: Move trace_rcu_nocb_wake() calls outside nocb_lock when possibleFrederic Weisbecker1-3/+3
Those tracing calls don't need to be under ->nocb_lock. This commit therefore moves them outside of that lock. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-03-15rcu/nocb: Remove stale comment above rcu_segcblist_offload()Frederic Weisbecker1-2/+1
This commit removes a stale comment claiming that the cblist must be empty before changing the offloading state. This claim was correct back when the offloaded state was defined exclusively at boot. Reported-by: Paul E. McKenney <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-03-15rcu/nocb: Disable bypass when CPU isn't completely offloadedFrederic Weisbecker1-9/+29
Currently, the bypass is flushed at the very last moment in the deoffloading procedure. However, this approach leads to a larger state space than would be preferred. This commit therefore disables the bypass at soon as the deoffloading procedure begins, then flushes it. This guarantees that the bypass remains empty and thus out of the way of the deoffloading procedure. Symmetrically, this commit waits to enable the bypass until the offloading procedure has completed. Reported-by: Paul E. McKenney <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>