aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-07-19rcu/nocb: Add option to opt rcuo kthreads out of RT priorityUladzislau Rezki (Sony)3-2/+23
This commit introduces a RCU_NOCB_CPU_CB_BOOST Kconfig option that prevents rcuo kthreads from running at real-time priority, even in kernels built with RCU_BOOST. This capability is important to devices needing low-latency (as in a few milliseconds) response from expedited RCU grace periods, but which are not running a classic real-time workload. On such devices, permitting the rcuo kthreads to run at real-time priority results in unacceptable latencies imposed on the application tasks, which run as SCHED_OTHER. See for example the following trace output: <snip> <...>-60 [006] d..1 2979.028717: rcu_batch_start: rcu_preempt CBs=34619 bl=270 <snip> If that rcuop kthread were permitted to run at real-time SCHED_FIFO priority, it would monopolize its CPU for hundreds of milliseconds while invoking those 34619 RCU callback functions, which would cause an unacceptably long latency spike for many application stacks on Android platforms. However, some existing real-time workloads require that callback invocation run at SCHED_FIFO priority, for example, those running on systems with heavy SCHED_OTHER background loads. (It is the real-time system's administrator's responsibility to make sure that important real-time tasks run at a higher priority than do RCU's kthreads.) Therefore, this new RCU_NOCB_CPU_CB_BOOST Kconfig option defaults to "y" on kernels built with PREEMPT_RT and defaults to "n" otherwise. The effect is to preserve current behavior for real-time systems, but for other systems to allow expedited RCU grace periods to run with real-time priority while continuing to invoke RCU callbacks as SCHED_OTHER. As you would expect, this RCU_NOCB_CPU_CB_BOOST Kconfig option has no effect except on CPUs with offloaded RCU callbacks. Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Joel Fernandes (Google) <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()Zqiang3-17/+22
Callbacks are invoked in RCU kthreads when calbacks are offloaded (rcu_nocbs boot parameter) or when RCU's softirq handler has been offloaded to rcuc kthreads (use_softirq==0). The current code allows for the rcu_nocbs case but not the use_softirq case. This commit adds support for the use_softirq case. Reported-by: kernel test robot <[email protected]> Signed-off-by: Zqiang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19rcu/nocb: Add an option to offload all CPUs on bootJoel Fernandes2-1/+27
Systems built with CONFIG_RCU_NOCB_CPU=y but booted without either the rcu_nocbs= or rcu_nohz_full= kernel-boot parameters will not have callback offloading on any of the CPUs, nor can any of the CPUs be switched to enable callback offloading at runtime. Although this is intentional, it would be nice to have a way to offload all the CPUs without having to make random bootloaders specify either the rcu_nocbs= or the rcu_nohz_full= kernel-boot parameters. This commit therefore provides a new CONFIG_RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that switches the default so as to offload callback processing on all of the CPUs. This default can still be overridden using the rcu_nocbs= and rcu_nohz_full= kernel-boot parameters. Reviewed-by: Kalesh Singh <[email protected]> Reviewed-by: Uladzislau Rezki <[email protected]> (In v4.1, fixed issues with CONFIG maze reported by kernel test robot). Reported-by: kernel test robot <[email protected]> Signed-off-by: Joel Fernandes <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() ↵Zqiang1-16/+64
direct call If the rcuog/o[p] kthreads spawn failed, the offloaded rdp needs to be explicitly deoffloaded, otherwise the target rdp is still considered offloaded even though nothing actually handles the callbacks. Signed-off-by: Zqiang <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Uladzislau Rezki <[email protected]> Cc: Joel Fernandes <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking orderZqiang1-4/+4
In case of failure to spawn either rcuog or rcuo[p] kthreads for a given rdp, rcu_nocb_rdp_deoffload() needs to be called with the hotplug lock and the barrier_mutex held. However cpus write lock is already held while calling rcutree_prepare_cpu(). It's not possible to call rcu_nocb_rdp_deoffload() from there with just locking the barrier_mutex or this would result in a locking inversion against rcu_nocb_cpu_deoffload() which holds both locks in the reverse order. Simply solve this with inverting the locking order inside rcu_nocb_cpu_[de]offload(). This will be a pre-requisite to toggle NOCB states toward cpusets anyway. Signed-off-by: Zqiang <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Uladzislau Rezki <[email protected]> Cc: Joel Fernandes <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19rcu/nocb: Add/del rdp to iterate from rcuog itselfFrederic Weisbecker2-68/+71
NOCB rdp's are part of a group whose list is iterated by the corresponding rdp leader. This list is RCU traversed because an rdp can be either added or deleted concurrently. Upon addition, a new iteration to the list after a synchronization point (a pair of LOCK/UNLOCK ->nocb_gp_lock) is forced to make sure: 1) we didn't miss a new element added in the middle of an iteration 2) we didn't ignore a whole subset of the list due to an element being quickly deleted and then re-added. 3) we prevent from probably other surprises... Although this layout is expected to be safe, it doesn't help anybody to sleep well. Simplify instead the nocb state toggling with moving the list modification from the nocb (de-)offloading workqueue to the rcuog kthreads instead. Whenever the rdp leader is expected to (re-)set the SEGCBLIST_KTHREAD_GP flag of a target rdp, the latter is queued so that the leader handles the flag flip along with adding or deleting the target rdp to the list to iterate. This way the list modification and iteration happen from the same kthread and those operations can't race altogether. As a bonus, the flags for each rdp don't need to be checked locklessly before each iteration, which is one less opportunity to produce nightmares. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Uladzislau Rezki <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Zqiang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19rcu/tree: Add comment to describe GP-done condition in fqs loopNeeraj Upadhyay1-1/+9
Add a comment to explain why !rcu_preempt_blocked_readers_cgp() condition is required on root rnp node, for GP completion check in rcu_gp_fqs_loop(). Reviewed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2022-07-19rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()Paul E. McKenney1-2/+1
This commit saves a line of code by initializing the rcu_gp_fqs() function's first_gp_fqs local variable in its declaration. Reported-by: Frederic Weisbecker <[email protected]> Reported-by: Neeraj Upadhyay <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2022-07-19rcu/kvfree: Remove useless monitor_todo flagJoel Fernandes (Google)1-17/+16
monitor_todo is not needed as the work struct already tracks if work is pending. Just use that to know if work is pending using schedule_delayed_work() helper. Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19rcu: Cleanup RCU urgency state for offline CPUZqiang1-0/+1
When a CPU is slow to provide a quiescent state for a given grace period, RCU takes steps to encourage that CPU to get with the quiescent-state program in a more timely fashion. These steps include these flags in the rcu_data structure: 1. ->rcu_urgent_qs, which causes the scheduling-clock interrupt to request an otherwise pointless context switch from the scheduler. 2. ->rcu_need_heavy_qs, which causes both cond_resched() and RCU's context-switch hook to do an immediate momentary quiscent state. 3. ->rcu_need_heavy_qs, which causes the scheduler-clock tick to be enabled even on nohz_full CPUs with only one runnable task. These flags are of course cleared once the corresponding CPU has passed through a quiescent state. Unless that quiescent state is the CPU going offline, which means that when the CPU comes back online, it will needlessly consume additional CPU time and incur additional latency, which constitutes a minor but very real performance bug. This commit therefore adds the call to rcu_disable_urgency_upon_qs() that clears these flags to the CPU-hotplug offlining code path. Signed-off-by: Zqiang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19rcu: tiny: Record kvfree_call_rcu() call stack for KASANJohannes Berg1-0/+14
When running KASAN with Tiny RCU (e.g. under ARCH=um, where a working KASAN patch is now available), we don't get any information on the original kfree_rcu() (or similar) caller when a problem is reported, as Tiny RCU doesn't record this. Add the recording, which required pulling kvfree_call_rcu() out of line for the KASAN case since the recording function (kasan_record_aux_stack_noalloc) is neither exported, nor can we include kasan.h into rcutiny.h. without KASAN, the patch has no size impact (ARCH=um kernel): text data bss dec hex filename 6151515 4423154 33148520 43723189 29b29b5 linux 6151515 4423154 33148520 43723189 29b29b5 linux + patch with KASAN, the impact on my build was minimal: text data bss dec hex filename 13915539 7388050 33282304 54585893 340ea25 linux 13911266 7392114 33282304 54585684 340e954 linux + patch -4273 +4064 +-0 -209 Acked-by: Dmitry Vyukov <[email protected]> Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2022-07-19locking/csd_lock: Change csdlock_debug from early_param to __setupChen Zhongjin1-2/+2
The csdlock_debug kernel-boot parameter is parsed by the early_param() function csdlock_debug(). If set, csdlock_debug() invokes static_branch_enable() to enable csd_lock_wait feature, which triggers a panic on arm64 for kernels built with CONFIG_SPARSEMEM=y and CONFIG_SPARSEMEM_VMEMMAP=n. With CONFIG_SPARSEMEM_VMEMMAP=n, __nr_to_section is called in static_key_enable() and returns NULL, resulting in a NULL dereference because mem_section is initialized only later in sparse_init(). This is also a problem for powerpc because early_param() functions are invoked earlier than jump_label_init(), also resulting in static_key_enable() failures. These failures cause the warning "static key 'xxx' used before call to jump_label_init()". Thus, early_param is too early for csd_lock_wait to run static_branch_enable(), so changes it to __setup to fix these. Fixes: 8d0968cc6b8f ("locking/csd_lock: Add boot parameter for controlling CSD lock debugging") Cc: [email protected] Reported-by: Chen jingwen <[email protected]> Signed-off-by: Chen Zhongjin <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2022-07-19srcu: Make expedited RCU grace periods block even less frequentlyNeeraj Upadhyay1-19/+63
The purpose of commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") was to prevent a long series of never-blocking expedited SRCU grace periods from blocking kernel-live-patching (KLP) progress. Although it was successful, it also resulted in excessive boot times on certain embedded workloads running under qemu with the "-bios QEMU_EFI.fd" command line. Here "excessive" means increasing the boot time up into the three-to-four minute range. This increase in boot time was due to the more than 6000 back-to-back invocations of synchronize_rcu_expedited() within the KVM host OS, which in turn resulted from qemu's emulation of a long series of MMIO accesses. Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") did not significantly help this particular use case. Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values of non-sleeping per phase counts on a system with preemption enabled, and observed the following boot times: +──────────────────────────+────────────────+ | SRCU_MAX_NODELAY_PHASE | Boot time (s) | +──────────────────────────+────────────────+ | 100 | 30.053 | | 150 | 25.151 | | 200 | 20.704 | | 250 | 15.748 | | 500 | 11.401 | | 1000 | 11.443 | | 10000 | 11.258 | | 1000000 | 11.154 | +──────────────────────────+────────────────+ Analysis on the experiment results show additional improvements with CPU-bound delays approaching one jiffy in duration. This improvement was also seen when number of per-phase iterations were scaled to one jiffy. This commit therefore scales per-grace-period phase number of non-sleeping polls so that non-sleeping polls extend for about one jiffy. In addition, the delay-calculation call to srcu_get_delay() in srcu_gp_end() is replaced with a simple check for an expedited grace period. This change schedules callback invocation immediately after expedited grace periods complete, which results in greatly improved boot times. Testing done by Marc and Zhangfei confirms that this change recovers most of the performance degradation in boottime; for CONFIG_HZ_250 configuration, specifically, boot times improve from 3m50s to 41s on Marc's setup; and from 2m40s to ~9.7s on Zhangfei's setup. In addition to the changes to default per phase delays, this change adds 3 new kernel parameters - srcutree.srcu_max_nodelay, srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay. This allows users to configure the srcu grace period scanning delays in order to more quickly react to additional use cases. Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Reported-by: Zhangfei Gao <[email protected]> Reported-by: yueluck <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]> Tested-by: Marc Zyngier <[email protected]> Tested-by: Zhangfei Gao <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Paul E. McKenney <[email protected]>
2022-07-19rcu: Forbid RCU_STRICT_GRACE_PERIOD in TINY_RCU kernelsPaul E. McKenney1-1/+1
The RCU_STRICT_GRACE_PERIOD Kconfig option does nothing in kernels built with CONFIG_TINY_RCU=y, so this commit adjusts the dependencies to disallow this combination. Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19srcu: Block less aggressively for expedited grace periodsPaul E. McKenney1-7/+13
Commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") fixed a problem where a long-running expedited SRCU grace period could block kernel live patching. It did so by giving up on expediting once a given SRCU expedited grace period grew too old. Unfortunately, this added excessive delays to boots of virtual embedded systems specifying "-bios QEMU_EFI.fd" to qemu. This commit therefore makes the transition away from expediting less aggressive, increasing the per-grace-period phase number of non-sleeping polls of readers from one to three and increasing the required grace-period age from one jiffy (actually from zero to one jiffies) to two jiffies (actually from one to two jiffies). Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Signed-off-by: Paul E. McKenney <[email protected]> Reported-by: Zhangfei Gao <[email protected]> Reported-by: chenxiang (M)" <[email protected]> Cc: Shameerali Kolothum Thodi <[email protected]> Cc: Paolo Bonzini <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]> Link: https://lore.kernel.org/all/[email protected]/
2022-07-19rcu: Immediately boost preempted readers for strict grace periodsZqiang1-1/+2
The intent of the CONFIG_RCU_STRICT_GRACE_PERIOD Konfig option is to cause normal grace periods to complete quickly in order to better catch errors resulting from improperly leaking pointers from RCU read-side critical sections. However, kernels built with this option enabled still wait for some hundreds of milliseconds before boosting RCU readers that have been preempted within their current critical section. The value of this delay is set by the CONFIG_RCU_BOOST_DELAY Kconfig option, which defaults to 500 milliseconds. This commit therefore causes kernels build with strict grace periods to ignore CONFIG_RCU_BOOST_DELAY. This causes rcu_initiate_boost() to start boosting immediately after all CPUs on a given leaf rcu_node structure have passed through their quiescent states. Signed-off-by: Zqiang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19rcu: Add rnp->cbovldmask check in rcutree_migrate_callbacks()Zqiang1-0/+1
Currently, the rcu_node structure's ->cbovlmask field is set in call_rcu() when a given CPU is suffering from callback overload. But if that CPU goes offline, the outgoing CPU's callbacks is migrated to the running CPU, which is likely to overload the running CPU. However, that CPU's bit in its leaf rcu_node structure's ->cbovlmask field remains zero. Initially, this is OK because the outgoing CPU's bit remains set. However, that bit will be cleared at the next end of a grace period, at which time it is quite possible that the running CPU will still be overloaded. If the running CPU invokes call_rcu(), then overload will be checked for and the bit will be set. Except that there is no guarantee that the running CPU will invoke call_rcu(), in which case the next grace period will fail to take the running CPU's overload condition into account. Plus, because the bit is not set, the end of the grace period won't check for overload on this CPU. This commit therefore adds a call to check_cb_ovld_locked() in rcutree_migrate_callbacks() to set the running CPU's ->cbovlmask bit appropriately. Signed-off-by: Zqiang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19rcu: Avoid tracing a few functions executed in stop machinePatrick Wang1-5/+5
Stop-machine recently started calling additional functions while waiting: ---------------------------------------------------------------- Former stop machine wait loop: do { cpu_relax(); => macro ... } while (curstate != STOPMACHINE_EXIT); ----------------------------------------------------------------- Current stop machine wait loop: do { stop_machine_yield(cpumask); => function (notraced) ... touch_nmi_watchdog(); => function (notraced, inside calls also notraced) ... rcu_momentary_dyntick_idle(); => function (notraced, inside calls traced) } while (curstate != MULTI_STOP_EXIT); ------------------------------------------------------------------ These functions (and the functions that they call) must be marked notrace to prevent them from being updated while they are executing. The consequences of failing to mark these functions can be severe: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 1-...!: (0 ticks this GP) idle=14f/1/0x4000000000000000 softirq=3397/3397 fqs=0 rcu: 3-...!: (0 ticks this GP) idle=ee9/1/0x4000000000000000 softirq=5168/5168 fqs=0 (detected by 0, t=8137 jiffies, g=5889, q=2 ncpus=4) Task dump for CPU 1: task:migration/1 state:R running task stack: 0 pid: 19 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: Task dump for CPU 3: task:migration/3 state:R running task stack: 0 pid: 29 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: rcu: rcu_preempt kthread timer wakeup didn't happen for 8136 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 rcu: Possible timer handling issue on cpu=2 timer-softirq=594 rcu: rcu_preempt kthread starved for 8137 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:I stack: 0 pid: 14 ppid: 2 flags:0x00000000 Call Trace: schedule+0x56/0xc2 schedule_timeout+0x82/0x184 rcu_gp_fqs_loop+0x19a/0x318 rcu_gp_kthread+0x11a/0x140 kthread+0xee/0x118 ret_from_exception+0x0/0x14 rcu: Stack dump where RCU GP kthread last ran: Task dump for CPU 2: task:migration/2 state:R running task stack: 0 pid: 24 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: This commit therefore marks these functions notrace: rcu_preempt_deferred_qs() rcu_preempt_need_deferred_qs() rcu_preempt_deferred_qs_irqrestore() [ paulmck: Apply feedback from Neeraj Upadhyay. ] Signed-off-by: Patrick Wang <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19rcu: Decrease FQS scan wait time in case of callback overloadingPaul E. McKenney1-1/+6
The force-quiesce-state loop function rcu_gp_fqs_loop() checks for callback overloading and does an immediate initial scan for idle CPUs if so. However, subsequent rescans will be carried out at as leisurely a rate as they always are, as specified by the rcutree.jiffies_till_next_fqs module parameter. It might be tempting to just continue immediately rescanning, but this turns the RCU grace-period kthread into a CPU hog. It might also be tempting to reduce the time between rescans to a single jiffy, but this can be problematic on larger systems. This commit therefore divides the normal time between rescans by three, rounding up. Thus a small system running at HZ=1000 that is suffering from callback overload will wait only one jiffy instead of the normal three between rescans. [ paulmck: Apply Neeraj Upadhyay feedback. ] Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Neeraj Upadhyay <[email protected]>
2022-07-19bpf: remove obsolete KMALLOC_MAX_SIZE restriction on array map value sizeAndrii Nakryiko1-4/+2
Syscall-side map_lookup_elem() and map_update_elem() used to use kmalloc() to allocate temporary buffers of value_size, so KMALLOC_MAX_SIZE limit on value_size made sense to prevent creation of array map that won't be accessible through syscall interface. But this limitation since has been lifted by relying on kvmalloc() in syscall handling code. So remove KMALLOC_MAX_SIZE, which among other things means that it's possible to have BPF global variable sections (.bss, .data, .rodata) bigger than 8MB now. Keep the sanity check to prevent trivial overflows like round_up(map->value_size, 8) and restrict value size to <= INT_MAX (2GB). Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-07-19bpf: make uniform use of array->elem_size everywhere in arraymap.cAndrii Nakryiko1-6/+8
BPF_MAP_TYPE_ARRAY is rounding value_size to closest multiple of 8 and stores that as array->elem_size for various memory allocations and accesses. But the code tends to re-calculate round_up(map->value_size, 8) in multiple places instead of using array->elem_size. Cleaning this up and making sure we always use array->size to avoid duplication of this (admittedly simple) logic for consistency. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-07-19bpf: fix potential 32-bit overflow when accessing ARRAY map elementAndrii Nakryiko1-8/+12
If BPF array map is bigger than 4GB, element pointer calculation can overflow because both index and elem_size are u32. Fix this everywhere by forcing 64-bit multiplication. Extract this formula into separate small helper and use it consistently in various places. Speculative-preventing formula utilizing index_mask trick is left as is, but explicit u64 casts are added in both places. Fixes: c85d69135a91 ("bpf: move memory size checks to bpf_map_charge_init()") Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-07-19bpf: fix lsm_cgroup build errors on esoteric configsStanislav Fomichev2-3/+7
This particular ones is about having the following: CONFIG_BPF_LSM=y # CONFIG_CGROUP_BPF is not set Also, add __maybe_unused to the args for the !CONFIG_NET cases. Reported-by: kernel test robot <[email protected]> Signed-off-by: Stanislav Fomichev <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-07-19irqdomain: Use hwirq_max instead of revmap_size for NOMAP domainsXu Qiang1-6/+6
NOMAP irq domains use the revmap_size field to indicate the maximum hwirq number the domain accepts. This is a bit confusing as revmap_size is usually used to indicate the size of the revmap array, which a NOMAP domain doesn't have. Instead, use the hwirq_max field which has the correct semantics, and keep revmap_size to 0 for a NOMAP domain. Signed-off-by: Xu Qiang <[email protected]> [maz: commit message] Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-07-19irqdomain: Report irq number for NOMAP domainsXu Qiang1-0/+2
When using a NOMAP domain, __irq_resolve_mapping() doesn't store the Linux IRQ number at the address optionally provided by the caller. While this isn't a huge deal (the returned value is guaranteed to the hwirq that was passed as a parameter), let's honour the letter of the API by writing the expected value. Fixes: d22558dd0a6c (“irqdomain: Introduce irq_resolve_mapping()”) Signed-off-by: Xu Qiang <[email protected]> [maz: commit message] Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-07-19dma-mapping: add dma_opt_mapping_size()John Garry1-0/+12
Streaming DMA mapping involving an IOMMU may be much slower for larger total mapping size. This is because every IOMMU DMA mapping requires an IOVA to be allocated and freed. IOVA sizes above a certain limit are not cached, which can have a big impact on DMA mapping performance. Provide an API for device drivers to know this "optimal" limit, such that they may try to produce mapping which don't exceed it. Signed-off-by: John Garry <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Acked-by: Martin K. Petersen <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-07-18signal: break out of wait loops on kthread_stop()Jason A. Donenfeld1-0/+1
I was recently surprised to learn that msleep_interruptible(), wait_for_completion_interruptible_timeout(), and related functions simply hung when I called kthread_stop() on kthreads using them. The solution to fixing the case with msleep_interruptible() was more simply to move to schedule_timeout_interruptible(). Why? The reason is that msleep_interruptible(), and many functions just like it, has a loop like this: while (timeout && !signal_pending(current)) timeout = schedule_timeout_interruptible(timeout); The call to kthread_stop() woke up the thread, so schedule_timeout_ interruptible() returned early, but because signal_pending() returned true, it went back into another timeout, which was never woken up. This wait loop pattern is common to various pieces of code, and I suspect that the subtle misuse in a kthread that caused a deadlock in the code I looked at last week is also found elsewhere. So this commit causes signal_pending() to return true when kthread_stop() is called, by setting TIF_NOTIFY_SIGNAL. The same also probably applies to the similar kthread_park() functionality, but that can be addressed later, as its semantics are slightly different. Cc: Eric W. Biederman <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]> v1: https://lkml.kernel.org/r/[email protected] v2: https://lkml.kernel.org/r/[email protected] v3: https://lkml.kernel.org/r/[email protected] v4: https://lkml.kernel.org/r/[email protected] v5: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2022-07-18timekeeping: contribute wall clock to rng on time changeJason A. Donenfeld1-1/+6
The rng's random_init() function contributes the real time to the rng at boot time, so that events can at least start in relation to something particular in the real world. But this clock might not yet be set that point in boot, so nothing is contributed. In addition, the relation between minor clock changes from, say, NTP, and the cycle counter is potentially useful entropic data. This commit addresses this by mixing in a time stamp on calls to settimeofday and adjtimex. No entropy is credited in doing so, so it doesn't make initialization faster, but it is still useful input to have. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: [email protected] Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Eric Biggers <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-07-18swiotlb: move struct io_tlb_slot to swiotlb.cChristoph Hellwig1-0/+6
No need to expose this structure definition in the header. Signed-off-by: Christoph Hellwig <[email protected]>
2022-07-18swiotlb: ensure a segment doesn't cross the area boundaryChao Gao1-1/+10
Free slots tracking assumes that slots in a segment can be allocated to fulfill a request. This implies that slots in a segment should belong to the same area. Although the possibility of a violation is low, it is better to explicitly enforce segments won't span multiple areas by adjusting the number of slabs when configuring areas. Signed-off-by: Chao Gao <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-07-18swiotlb: consolidate rounding up default_nslabsChao Gao1-17/+16
default_nslabs are rounded up in two cases with exactly same comments. Add a simple wrapper to reduce duplicate code/comments. It is preparatory to adding more logics into the round-up. No functional change intended. Signed-off-by: Chao Gao <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-07-18swiotlb: remove unused fields in io_tlb_memChao Gao1-2/+0
Commit 20347fca71a3 ("swiotlb: split up the global swiotlb lock") splits io_tlb_mem into multiple areas. Each area has its own lock and index. The global ones are not used so remove them. Signed-off-by: Chao Gao <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-07-18swiotlb: fix use after free on error handling pathDan Carpenter1-1/+1
Don't dereference "mem" after it has been freed. Flip the two kfree()s around to address this bug. Fixes: 26ffb91fa5e0 ("swiotlb: split up the global swiotlb lock") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-07-17kdump: round up the total memory size to 128M for crashkernel reservationTao Liu1-2/+12
The total memory size we get in kernel is usually slightly less than the actual memory size because BIOS/firmware will reserve some memory region. So it won't export all memory as usable. E.g, on my x86_64 kvm guest with 1G memory, the total_mem value shows: UEFI boot with ovmf: 0x3faef000 Legacy boot kvm guest: 0x3ff7ec00 When specifying crashkernel=1G-2G:128M, if we have a 1G memory machine, we get total size 1023M from firmware. Then it will not fall into 1G-2G, thus no memory reserved. User will never know this, it is hard to let user know the exact total value in kernel. One way is to use dmi/smbios to get physical memory size, but it's not reliable as well. According to Prarit hardware vendors sometimes screw this up. Thus round up total size to 128M to work around this problem. This patch is a resend of [1] and rebased onto v5.19-rc2, and the original credit goes to Dave Young. [1]: http://lists.infradead.org/pipermail/kexec/2018-April/020568.html Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tao Liu <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Dave Young <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17vmcoreinfo: include kallsyms symbolsStephen Brennan1-0/+14
The internal kallsyms tables contain information which could be quite useful to a debugging tool in the absence of other debuginfo. If kallsyms is enabled, then a debugging tool could parse it and use it as a fallback symbol table. Combined with BTF data, live & post-mortem debuggers can support basic operations without needing a large DWARF debuginfo file available. As many as five symbols are necessary to properly parse kallsyms names and addresses. Add these to the vmcoreinfo note. CONFIG_KALLSYMS_ABSOLUTE_PERCPU does impact the computation of symbol addresses. However, a debugger can infer this configuration value by comparing the address of _stext in the vmcoreinfo with the address computed via kallsyms. So there's no need to include information about this config value in the vmcoreinfo note. To verify that we're still well below the maximum of 4096 bytes, I created a script[1] to compute a rough upper bound on the possible size of vmcoreinfo. On v5.18-rc7, the script reports 3106 bytes, and with this patch, the maximum become 3370 bytes. [1]: https://github.com/brenns10/kernel_stuff/blob/master/vmcoreinfosize/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stephen Brennan <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Bixuan Cui <[email protected]> Cc: Dave Young <[email protected]> Cc: David Vernet <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Sami Tolvanen <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17kallsyms: move declarations to internal headerStephen Brennan2-22/+31
Patch series "Expose kallsyms data in vmcoreinfo note". The kernel can be configured to contain a lot of introspection or debugging information built-in, such as ORC for unwinding stack traces, BTF for type information, and of course kallsyms. Debuggers could use this information to navigate a core dump or live system, but they need to be able to find it. This patch series adds the necessary symbols into vmcoreinfo, which would allow a debugger to find and interpret the kallsyms table. Using the kallsyms data, the debugger can then lookup any symbol, allowing it to find ORC, BTF, or any other useful data. This would allow a live kernel, or core dump, to be debugged without any DWARF debuginfo. This is useful for many cases: the debuginfo may not have been generated, or you may not want to deploy the large files everywhere you need them. I've demonstrated a proof of concept for this at LSF/MM+BPF during a lighting talk. Using a work-in-progress branch of the drgn debugger, and an extended set of BTF generated by a patched version of dwarves, I've been able to open a core dump without any DWARF info and do basic tasks such as enumerating slab caches, block devices, tasks, and doing backtraces. I hope this series can be a first step toward a new possibility of "DWARFless debugging". Related discussion around the BTF side of this: https://lore.kernel.org/bpf/[email protected]/T/#u Some work-in-progress branches using this feature: https://github.com/brenns10/dwarves/tree/remove_percpu_restriction_1 https://github.com/brenns10/drgn/tree/kallsyms_plus_btf This patch (of 2): To include kallsyms data in the vmcoreinfo note, we must make the symbol declarations visible outside of kallsyms.c. Move these to a new internal header file. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stephen Brennan <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Dave Young <[email protected]> Cc: Kees Cook <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Bixuan Cui <[email protected]> Cc: David Vernet <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Sami Tolvanen <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17Merge tag 'perf_urgent_for_v5.19_rc7' of ↵Linus Torvalds1-14/+31
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - A single data race fix on the perf event cleanup path to avoid endless loops due to insufficient locking * tag 'perf_urgent_for_v5.19_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix data race between perf_event_set_output() and perf_mmap_close()
2022-07-16Merge tag 'printk-for-5.19-rc7' of ↵Linus Torvalds1-2/+11
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Make pr_flush() fast when consoles are suspended. * tag 'printk-for-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: do not wait for consoles when suspended
2022-07-16fs: remove no_llseekJason A. Donenfeld1-2/+1
Now that all callers of ->llseek are going through vfs_llseek(), we don't gain anything by keeping no_llseek around. Nothing actually calls it and setting ->llseek to no_lseek is completely equivalent to leaving it NULL. Longer term (== by the end of merge window) we want to remove all such intializations. To simplify the merge window this commit does *not* touch initializers - it only defines no_llseek as NULL (and simplifies the tests on file opening). At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Signed-off-by: Jason A. Donenfeld <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-07-15blktrace: Fix the blk_fill_rwbs() kernel-doc headerBart Van Assche1-3/+3
Reflect recent changes in the blk_fill_rwbs() kernel-doc header. Reported-by: Stephen Rothwell <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Li Zefan <[email protected]> Cc: Chaitanya Kulkarni <[email protected]> Cc: Stephen Rothwell <[email protected]> Fixes: 919dbca8670d ("blktrace: Use the new blk_opf_t type") Signed-off-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-15bpf: iterators: Build and use lightweight bootstrap version of bpftoolPu Lehui1-7/+3
kernel/bpf/preload/iterators use bpftool for vmlinux.h, skeleton, and static linking only. So we can use lightweight bootstrap version of bpftool to handle these, and it will be faster. Suggested-by: Andrii Nakryiko <[email protected]> Signed-off-by: Pu Lehui <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-07-15security: Add LSM hook to setgroups() syscallMicah Morton1-0/+13
Give the LSM framework the ability to filter setgroups() syscalls. There are already analagous hooks for the set*uid() and set*gid() syscalls. The SafeSetID LSM will use this new hook to ensure setgroups() calls are allowed by the installed security policy. Tested by putting print statement in security_task_fix_setgroups() hook and confirming that it gets hit when userspace does a setgroups() syscall. Acked-by: Casey Schaufler <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Signed-off-by: Micah Morton <[email protected]>
2022-07-15PM: EM: convert power field to micro-Watts precision and align driversLukasz Luba1-8/+16
The milli-Watts precision causes rounding errors while calculating efficiency cost for each OPP. This is especially visible in the 'simple' Energy Model (EM), where the power for each OPP is provided from OPP framework. This can cause some OPPs to be marked inefficient, while using micro-Watts precision that might not happen. Update all EM users which access 'power' field and assume the value is in milli-Watts. Solve also an issue with potential overflow in calculation of energy estimation on 32bit machine. It's needed now since the power value (thus the 'cost' as well) are higher. Example calculation which shows the rounding error and impact: power = 'dyn-power-coeff' * volt_mV * volt_mV * freq_MHz power_a_uW = (100 * 600mW * 600mW * 500MHz) / 10^6 = 18000 power_a_mW = (100 * 600mW * 600mW * 500MHz) / 10^9 = 18 power_b_uW = (100 * 605mW * 605mW * 600MHz) / 10^6 = 21961 power_b_mW = (100 * 605mW * 605mW * 600MHz) / 10^9 = 21 max_freq = 2000MHz cost_a_mW = 18 * 2000MHz/500MHz = 72 cost_a_uW = 18000 * 2000MHz/500MHz = 72000 cost_b_mW = 21 * 2000MHz/600MHz = 70 // <- artificially better cost_b_uW = 21961 * 2000MHz/600MHz = 73203 The 'cost_b_mW' (which is based on old milli-Watts) is misleadingly better that the 'cost_b_uW' (this patch uses micro-Watts) and such would have impact on the 'inefficient OPPs' information in the Cpufreq framework. This patch set removes the rounding issue. Signed-off-by: Lukasz Luba <[email protected]> Acked-by: Daniel Lezcano <[email protected]> Acked-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2022-07-15bpf: Fix check against plain integer v 'NULL'Ben Dooks1-1/+2
When checking with sparse, btf_show_type_value() is causing a warning about checking integer vs NULL when the macro is passed a pointer, due to the 'value != 0' check. Stop sparse complaining about any type-casting by adding a cast to the typeof(value). This fixes the following sparse warnings: kernel/bpf/btf.c:2579:17: warning: Using plain integer as NULL pointer kernel/bpf/btf.c:2581:17: warning: Using plain integer as NULL pointer kernel/bpf/btf.c:3407:17: warning: Using plain integer as NULL pointer kernel/bpf/btf.c:3758:9: warning: Using plain integer as NULL pointer Signed-off-by: Ben Dooks <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-07-15Merge tag 'sysctl-fixes-5.19-rc7' of ↵Linus Torvalds1-9/+11
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pyll sysctl fix from Luis Chamberlain: "Only one fix for sysctl" * tag 'sysctl-fixes-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: mm: sysctl: fix missing numa_stat when !CONFIG_HUGETLB_PAGE
2022-07-15kexec, KEYS: make the code in bzImage64_verify_sig genericCoiby Xu1-0/+17
commit 278311e417be ("kexec, KEYS: Make use of platform keyring for signature verify") adds platform keyring support on x86 kexec but not arm64. The code in bzImage64_verify_sig uses the keys on the .builtin_trusted_keys, .machine, if configured and enabled, .secondary_trusted_keys, also if configured, and .platform keyrings to verify the signed kernel image as PE file. Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: Michal Suchanek <[email protected]> Signed-off-by: Coiby Xu <[email protected]> Signed-off-by: Mimi Zohar <[email protected]>
2022-07-15kexec: clean up arch_kexec_kernel_verify_sigCoiby Xu1-20/+13
Before commit 105e10e2cf1c ("kexec_file: drop weak attribute from functions"), there was already no arch-specific implementation of arch_kexec_kernel_verify_sig. With weak attribute dropped by that commit, arch_kexec_kernel_verify_sig is completely useless. So clean it up. Note later patches are dependent on this patch so it should be backported to the stable tree as well. Cc: [email protected] Suggested-by: Eric W. Biederman <[email protected]> Reviewed-by: Michal Suchanek <[email protected]> Acked-by: Baoquan He <[email protected]> Signed-off-by: Coiby Xu <[email protected]> [[email protected]: reworded patch description "Note"] Link: https://lore.kernel.org/linux-integrity/[email protected]/ Signed-off-by: Mimi Zohar <[email protected]>
2022-07-15kexec: drop weak attribute from functionsNaveen N. Rao1-27/+0
Drop __weak attribute from functions in kexec_core.c: - machine_kexec_post_load() - arch_kexec_protect_crashkres() - arch_kexec_unprotect_crashkres() - crash_free_reserved_phys_range() Link: https://lkml.kernel.org/r/c0f6219e03cb399d166d518ab505095218a902dd.1656659357.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Naveen N. Rao <[email protected]> Suggested-by: Eric Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Mimi Zohar <[email protected]>
2022-07-15kexec_file: drop weak attribute from functionsNaveen N. Rao1-33/+2
As requested (http://lkml.kernel.org/r/[email protected]), this series converts weak functions in kexec to use the #ifdef approach. Quoting the 3e35142ef99fe ("kexec_file: drop weak attribute from arch_kexec_apply_relocations[_add]") changelog: : Since commit d1bcae833b32f1 ("ELF: Don't generate unused section symbols") : [1], binutils (v2.36+) started dropping section symbols that it thought : were unused. This isn't an issue in general, but with kexec_file.c, gcc : is placing kexec_arch_apply_relocations[_add] into a separate : .text.unlikely section and the section symbol ".text.unlikely" is being : dropped. Due to this, recordmcount is unable to find a non-weak symbol in : .text.unlikely to generate a relocation record against. This patch (of 2); Drop __weak attribute from functions in kexec_file.c: - arch_kexec_kernel_image_probe() - arch_kimage_file_post_load_cleanup() - arch_kexec_kernel_image_load() - arch_kexec_locate_mem_hole() - arch_kexec_kernel_verify_sig() arch_kexec_kernel_image_load() calls into kexec_image_load_default(), so drop the static attribute for the latter. arch_kexec_kernel_verify_sig() is not overridden by any architecture, so drop the __weak attribute. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/2cd7ca1fe4d6bb6ca38e3283c717878388ed6788.1656659357.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Naveen N. Rao <[email protected]> Suggested-by: Eric Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Mimi Zohar <[email protected]>
2022-07-15Merge branch 'rework/kthreads' into for-linusPetr Mladek1-2/+11