aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-02-27Merge branch 'x86/urgent' into x86/apic, to resolve conflictsIngo Molnar5-11/+11
Conflicts: arch/x86/kernel/cpu/common.c arch/x86/kernel/cpu/intel.c Signed-off-by: Ingo Molnar <[email protected]>
2024-02-27smp: Avoid 'setup_max_cpus' namespace collision/shadowingIngo Molnar1-3/+3
bringup_nonboot_cpus() gets passed the 'setup_max_cpus' variable in init/main.c - which is also the name of the parameter, shadowing the name. To reduce confusion and to allow the 'setup_max_cpus' value to be #defined in the <linux/smp.h> header, use the 'max_cpus' name for the function parameter name. Signed-off-by: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected]
2024-02-26Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', ↵Boqun Feng14-305/+384
'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a
2024-02-26timers: Assert no next dyntick timer look-up while CPU is offlineFrederic Weisbecker1-3/+3
The next timer (re-)evaluation, with the purpose of entering/updating the dyntick mode, can happen from 3 sites and none of them are relevant while the CPU is offline: 1) The idle loop: a) From the quick check helping the cpuidle governor to heuristically predict the best C-state. b) While stopping the tick. But if the CPU is offline, the tick has been cancelled and there is consequently no need to further stop the tick. 2) Remote expiry: when a CPU remotely expires global timers on behalf of another CPU, the latter target's next timer is re-evaluated afterwards. However remote expîry doesn't happen on offline CPUs. 3) IRQ exit: on nohz_full mode, the tick is (re-)evaluated on IRQ exit. But full dynticks is disabled on offline CPUs. Therefore it is safe to assume that no next dyntick timer lookup can be performed on offline CPUs. Assert this expectation to report any surprise. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick: Assume timekeeping is correctly handed over upon last offline idle callFrederic Weisbecker4-13/+6
The timekeeping duty is handed over from the outgoing CPU on stop machine, then the oneshot tick is stopped right after. Therefore it's guaranteed that the current CPU isn't the timekeeper upon its last call to idle. Besides, calling tick_nohz_idle_stop_tick() while the dying CPU goes into idle suggests that the tick is going to be stopped while it is actually stopped already from the appropriate CPU hotplug state. Remove the confusing call and the obsolete case handling and convert it to a sanity check that verifies the above assumption. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick: Shut down low-res tick from dying CPUFrederic Weisbecker3-10/+29
The timekeeping duty is handed over from the outgoing CPU within stop machine. This works well if CONFIG_NO_HZ_COMMON=n or the tick is in high-res mode. However in low-res dynticks mode, the tick isn't cancelled until the clockevent is shut down, which can happen later. The tick may therefore fire again once IRQs are re-enabled on stop machine and until IRQs are disabled for good upon the last call to idle. That's so many opportunities for a timekeeper to go idle and the outgoing CPU to take over that duty. This is why tick_nohz_idle_stop_tick() is called one last time on idle if the CPU is seen offline: so that the timekeeping duty is handed over again in case the CPU has re-taken the duty. This means there are two timekeeping handovers on CPU down hotplug with different undocumented constraints and purposes: 1) A handover on stop machine for !dynticks || highres. All online CPUs are guaranteed to be non-idle and the timekeeping duty can be safely handed-over. The hrtimer tick is cancelled so it is guaranteed that in dynticks mode the outgoing CPU won't take again the duty. 2) A handover on last idle call for dynticks && lowres. Setting the duty to TICK_DO_TIMER_NONE makes sure that a CPU will take over the timekeeping. Prepare for consolidating the handover to a single place (the first one) with shutting down the low-res tick as well from tick_cancel_sched_timer() as well. This will simplify the handover and unify the tick cancellation between high-res and low-res. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick: Split nohz and highres features from nohz_modeFrederic Weisbecker4-26/+26
The nohz mode field tells about low resolution nohz mode or high resolution nohz mode but it doesn't tell about high resolution non-nohz mode. In order to retrieve the latter state, tick_cancel_sched_timer() must fiddle with struct hrtimer's internals to guess if the tick has been initialized in high resolution. Move instead the nohz mode field information into the tick flags and provide two new bits: one to know if the tick is in nohz mode and another one to know if the tick is in high resolution. The combination of those two flags provides all the needed informations to determine which of the three tick modes is running. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick: Move individual bit features to debuggable mask accessesFrederic Weisbecker3-43/+73
The individual bitfields of struct tick_sched must be modified from IRQs disabled places, otherwise local modifications can race due to them sharing the same memory storage. The recent move of the "got_idle_tick" bitfield to its own storage shows that the use of these bitfields, as pretty as they look, can be as much error prone. In order to avoid future issues of the like and make sure that those bitfields are safely accessed, move those flags to an explicit mask along with a mutator function performing the basic IRQs disabled sanity check. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick: Move got_idle_tick away from common flagsFrederic Weisbecker1-1/+1
tick_nohz_idle_got_tick() is called by cpuidle_reflect() within the idle loop with interrupts enabled. This function modifies the struct tick_sched's bitfield "got_idle_tick". However this bitfield is stored within the same mask as other bitfields that can be modified from interrupts. Fortunately so far it looks like the only race that can happen is while writing ->got_idle_tick to 0, an interrupt fires and writes the ->idle_active field to 0. It's then possible that the interrupted write to ->got_idle_tick writes back the old value of ->idle_active back to 1. However if that happens, the worst possible outcome is that the time spent between that interrupt and the upcoming call to tick_nohz_idle_exit() is accounted as idle, which is negligible quantity. Still all the bitfield writes within this struct tick_sched's shadow mask should be IRQ-safe. Therefore move this bitfield out to its own storage to avoid further suprises. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE modeFrederic Weisbecker1-1/+1
The full-nohz update function checks if the nohz mode is active before proceeding. It considers one exception though: if the tick is already stopped even though the nohz mode is inactive, it still moves on in order to update/restart the tick if needed. However in order for the tick to be stopped, the nohz_mode has to be either NOHZ_MODE_LOWRES or NOHZ_MODE_HIGHRES. Therefore it doesn't make sense to test if the tick is stopped before verifying NOHZ_MODE_INACTIVE mode. Remove the needless related condition. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYINGFrederic Weisbecker3-2/+5
The broadcast shutdown code is executed through a random explicit call within stop machine from the outgoing CPU. However the tick broadcast is a midware between the tick callback and the clocksource, therefore it makes more sense to shut it down after the tick callback and before the clocksource drivers. Move it instead to the common tick shutdown CPU hotplug state where related operations can be ordered from highest to lowest level. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick: Move tick cancellation up to CPUHP_AP_TICK_DYINGFrederic Weisbecker2-2/+2
The tick hrtimer is cancelled right before hrtimers are migrated. This is done from the hrtimer subsystem even though it shouldn't know about its actual users. Move instead the tick hrtimer cancellation to the relevant CPU hotplug state that aims at centralizing high level tick shutdown operations so that the related flow is easy to follow. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick: Start centralizing tick related CPU hotplug operationsFrederic Weisbecker2-9/+16
During the CPU offlining process, the various timer tick features are shut down from scattered places, sometimes from teardown callbacks on stop machine, sometimes through explicit calls, sometimes from the control CPU after the CPU died. The reason why these shutdown operations are spread around is not always clear and it makes the tick lifecycle hard to follow. The tick should be shut down in order from highest to lowest level: On stop machine from the dying CPU (high-level): 1) Hand-over the timekeeping duty (tick_handover_do_timer()) 2) Cancel the tick implementation called by the clockevent callback (tick_cancel_sched_timer()) 3) Shutdown broadcasting (tick_offline_cpu() / tick_broadcast_offline()) On stop machine from the dying CPU (low-level): 4) Shutdown clockevents drivers (CPUHP_AP_*_TIMER_STARTING states) From the control CPU after the CPU died (low-level): 5) Shutdown/unregister/cleanup clockevents for the dead CPU (tick_cleanup_dead_cpu()) Instead the current order is 2, 4 (both from CPU hotplug states), then 1 and 3 through direct calls. This layout and order don't make much sense. The operations 1, 2, 3 should be gathered together and in order. Sort this situation with creating a new TICK shut-down CPU hotplug state and start with introducing the timekeeping duty hand-over there. The state must precede hrtimers migration because the tick hrtimer will be stopped from it in a further patch. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()Frederic Weisbecker1-5/+0
The tick sched structure is already cleared from tick_cancel_sched_timer(), so there is no need to clear that field again. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()Frederic Weisbecker1-2/+2
tick_nohz_stop_sched_tick() is only about NOHZ_full and not about dynticks-idle. Reflect that in the function name to avoid confusion. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick: Use IS_ENABLED() whenever possibleFrederic Weisbecker2-12/+6
Avoid ifdeferry if it can be converted to IS_ENABLED() whenever possible Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick/sched: Remove useless oneshot ifdefferyFrederic Weisbecker1-5/+1
tick-sched.c is only built when CONFIG_TICK_ONESHOT=y, which is selected only if CONFIG_NO_HZ_COMMON=y or CONFIG_HIGH_RES_TIMERS=y. Therefore the related ifdeferry in this file is needless and can be removed. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick/nohz: Remove duplicate between lowres and highres handlersPeng Liu1-60/+36
tick_nohz_lowres_handler() does the same work as tick_nohz_highres_handler() plus the clockevent device reprogramming, so make the former reuse the latter and rename it accordingly. Signed-off-by: Peng Liu <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and ↵Peng Liu3-23/+20
tick_setup_sched_timer() The ts->sched_timer initialization work of tick_nohz_switch_to_nohz() is almost the same as that of tick_setup_sched_timer(), so adjust the latter to get it reused by tick_nohz_switch_to_nohz(). This also makes the low resolution mode sched_timer benefit from the tick skew boot option. Signed-off-by: Peng Liu <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-25rcu-tasks: Maintain real-time response in rcu_tasks_postscan()Paul E. McKenney1-1/+21
The current code will scan the entirety of each per-CPU list of exiting tasks in ->rtp_exit_list with interrupts disabled. This is normally just fine, because each CPU typically won't have very many tasks in this state. However, if a large number of tasks block late in do_exit(), these lists could be arbitrarily long. Low probability, perhaps, but it really could happen. This commit therefore occasionally re-enables interrupts while traversing these lists, inserting a dummy element to hold the current place in the list. In kernels built with CONFIG_PREEMPT_RT=y, this re-enabling happens after each list element is processed, otherwise every one-to-two jiffies. [ paulmck: Apply Frederic Weisbecker feedback. ] Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Paul E. McKenney <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Anna-Maria Behnsen <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Boqun Feng <[email protected]>
2024-02-25rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasksPaul E. McKenney1-16/+28
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce. In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact. This commit therefore eliminates these deadlock by replacing the SRCU-based wait for do_exit() completion with per-CPU lists of tasks currently exiting. A given task will be on one of these per-CPU lists for the same period of time that this task would previously have been in the previous SRCU read-side critical section. These lists enable RCU Tasks to find the tasks that have already been removed from the tasks list, but that must nevertheless be waited upon. The RCU Tasks grace period gathers any of these do_exit() tasks that it must wait on, and adds them to the list of holdouts. Per-CPU locking and get_task_struct() are used to synchronize addition to and removal from these lists. Link: https://lore.kernel.org/all/[email protected]/ Reported-by: Chen Zhongjin <[email protected]> Reported-by: Yang Jihong <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Tested-by: Yang Jihong <[email protected]> Tested-by: Chen Zhongjin <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Signed-off-by: Boqun Feng <[email protected]>
2024-02-25rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocksPaul E. McKenney1-10/+33
This commit continues the elimination of deadlocks involving do_exit() and RCU tasks by causing exit_tasks_rcu_start() to add the current task to a per-CPU list and causing exit_tasks_rcu_stop() to remove the current task from whatever list it is on. These lists will be used to track tasks that are exiting, while still accounting for any RCU-tasks quiescent states that these tasks pass though. [ paulmck: Apply Frederic Weisbecker feedback. ] Link: https://lore.kernel.org/all/[email protected]/ Reported-by: Chen Zhongjin <[email protected]> Reported-by: Yang Jihong <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Tested-by: Yang Jihong <[email protected]> Tested-by: Chen Zhongjin <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Signed-off-by: Boqun Feng <[email protected]>
2024-02-25rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocksPaul E. McKenney2-0/+3
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce. In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact. This commit therefore initializes the data structures that will be needed to rely on these quiescent states and to eliminate these deadlocks. Link: https://lore.kernel.org/all/[email protected]/ Reported-by: Chen Zhongjin <[email protected]> Reported-by: Yang Jihong <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Tested-by: Yang Jihong <[email protected]> Tested-by: Chen Zhongjin <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Signed-off-by: Boqun Feng <[email protected]>
2024-02-25rcu-tasks: Initialize callback lists at rcu_init() timePaul E. McKenney4-6/+27
In order for RCU Tasks to reliably maintain per-CPU lists of exiting tasks, those lists must be initialized before it is possible for tasks to exit, especially given that the boot CPU is not necessarily CPU 0 (an example being, powerpc kexec() kernels). And at the time that rcu_init_tasks_generic() is called, a task could potentially exit, unconventional though that sort of thing might be. This commit therefore moves the calls to cblist_init_generic() from functions called from rcu_init_tasks_generic() to a new function named tasks_cblist_init_generic() that is invoked from rcu_init(). This constituted a bug in a commit that never went to mainline, so there is no need for any backporting to -stable. Reported-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Boqun Feng <[email protected]>
2024-02-25rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocksPaul E. McKenney1-0/+2
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce. In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact. This commit therefore adds the data structures that will be needed to rely on these quiescent states and to eliminate these deadlocks. Link: https://lore.kernel.org/all/[email protected]/ Reported-by: Chen Zhongjin <[email protected]> Reported-by: Yang Jihong <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Tested-by: Yang Jihong <[email protected]> Tested-by: Chen Zhongjin <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Signed-off-by: Boqun Feng <[email protected]>
2024-02-25power: port block device access to fileChristian Brauner1-14/+14
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-02-24sched: Add a new function to compare if two cpus have the same capacityQais Yousef1-0/+11
The new helper function is needed to help blk-mq check if it needs to dispatch the softirq on another CPU to match the performance level the IO requester is running at. This is important on HMP systems where not all CPUs have the same compute capacity. Signed-off-by: Qais Yousef <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-23arm64, crash: wrap crash dumping code into crash related ifdefsBaoquan He1-0/+2
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on arm64 with some adjustments. Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery. [[email protected]: fix building error in generic codes] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Cc: Al Viro <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Hari Bathini <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Klara Modin <[email protected]> Cc: Michael Kelley <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Yang Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23crash: clean up kdump related config itemsBaoquan He1-3/+4
By splitting CRASH_RESERVE and VMCORE_INFO out from CRASH_CORE, cleaning up the dependency of FA_DMUMP on CRASH_DUMP, and moving crash codes from kexec_core.c to crash_core.c, now we can rearrange CRASH_DUMP to depend on KEXEC_CORE, and make CRASH_DUMP select CRASH_RESERVE and VMCORE_INFO. KEXEC_CORE won't select CRASH_RESERVE and VMCORE_INFO any more because KEXEC_CORE enables codes which allocate control pages, copy kexec/kdump segments, and prepare for switching. These codes are shared by both kexec reboot and crash dumping. Doing this makes codes and the corresponding config items more logical (the right item depends on or is selected by the left item). PROC_KCORE -----------> VMCORE_INFO |----------> VMCORE_INFO FA_DUMP----| |----------> CRASH_RESERVE ---->VMCORE_INFO / |---->CRASH_RESERVE KEXEC --| /| |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE KEXEC_FILE --| \ | \---->CRASH_HOTPLUG KEXEC --| |--> KEXEC_CORE--> kexec reboot KEXEC_FILE --| Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Cc: Al Viro <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Hari Bathini <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Klara Modin <[email protected]> Cc: Michael Kelley <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Yang Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23crash: split crash dumping code out from kexec_core.cBaoquan He6-244/+293
Currently, KEXEC_CORE select CRASH_CORE automatically because crash codes need be built in to avoid compiling error when building kexec code even though the crash dumping functionality is not enabled. E.g -------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC=y CONFIG_KEXEC_FILE=y --------------------- After splitting out crashkernel reservation code and vmcoreinfo exporting code, there's only crash related code left in kernel/crash_core.c. Now move crash related codes from kexec_core.c to crash_core.c and only build it in when CONFIG_CRASH_DUMP=y. And also wrap up crash codes inside CONFIG_CRASH_DUMP ifdeffery scope, or replace inappropriate CONFIG_KEXEC_CORE ifdef with CONFIG_CRASH_DUMP ifdef in generic kernel files. With these changes, crash_core codes are abstracted from kexec codes and can be disabled at all if only kexec reboot feature is wanted. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Cc: Al Viro <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Hari Bathini <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Klara Modin <[email protected]> Cc: Michael Kelley <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Yang Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23crash: remove dependency of FA_DUMP on CRASH_DUMPBaoquan He3-2/+3
In kdump kernel, /proc/vmcore is an elf file mapping the crashed kernel's old memory content. Its elf header is constructed in 1st kernel and passed to kdump kernel via elfcorehdr_addr. Config CRASH_DUMP enables the code of 1st kernel's old memory accessing in different architectures. Currently, config FA_DUMP has dependency on CRASH_DUMP because fadump needs access global variable 'elfcorehdr_addr' to judge if it's in kdump kernel within function is_kdump_kernel(). In the current kernel/crash_dump.c, variable 'elfcorehdr_addr' is defined, and function setup_elfcorehdr() used to parse kernel parameter to fetch the passed value of elfcorehdr_addr. Only for accessing elfcorehdr_addr, FA_DUMP really doesn't have to depends on CRASH_DUMP. To remove the dependency of FA_DUMP on CRASH_DUMP to avoid confusion, rename kernel/crash_dump.c to kernel/elfcorehdr.c, and build it when CONFIG_VMCORE_INFO is ebabled. With this, FA_DUMP doesn't need to depend on CRASH_DUMP. [[email protected]: power/fadump: make FA_DUMP select CRASH_DUMP] Link: https://lkml.kernel.org/r/Zb8D1ASrgX0qVm9z@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Acked-by: Hari Bathini <[email protected]> Cc: Al Viro <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Klara Modin <[email protected]> Cc: Michael Kelley <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Yang Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23crash: split vmcoreinfo exporting code out from crash_core.cBaoquan He6-215/+239
Now move the relevant codes into separate files: kernel/crash_reserve.c, include/linux/crash_reserve.h. And add config item CRASH_RESERVE to control its enabling. And also update the old ifdeffery of CONFIG_CRASH_CORE, including of <linux/crash_core.h> and config item dependency on CRASH_CORE accordingly. And also do renaming as follows: - arch/xxx/kernel/{crash_core.c => vmcore_info.c} because they are only related to vmcoreinfo exporting on x86, arm64, riscv. And also Remove config item CRASH_CORE, and rely on CONFIG_KEXEC_CORE to decide if build in crash_core.c. [[email protected]: remove duplicated include in vmcore_info.c] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Signed-off-by: Yang Li <[email protected]> Acked-by: Hari Bathini <[email protected]> Cc: Al Viro <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Klara Modin <[email protected]> Cc: Michael Kelley <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Yang Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23kexec: split crashkernel reservation code out from crash_core.cBaoquan He4-439/+469
Patch series "Split crash out from kexec and clean up related config items", v3. Motivation: ============= Previously, LKP reported a building error. When investigating, it can't be resolved reasonablly with the present messy kdump config items. https://lore.kernel.org/oe-kbuild-all/[email protected]/ The kdump (crash dumping) related config items could causes confusions: Firstly, CRASH_CORE enables codes including - crashkernel reservation; - elfcorehdr updating; - vmcoreinfo exporting; - crash hotplug handling; Now fadump of powerpc, kcore dynamic debugging and kdump all selects CRASH_CORE, while fadump - fadump needs crashkernel parsing, vmcoreinfo exporting, and accessing global variable 'elfcorehdr_addr'; - kcore only needs vmcoreinfo exporting; - kdump needs all of the current kernel/crash_core.c. So only enabling PROC_CORE or FA_DUMP will enable CRASH_CORE, this mislead people that we enable crash dumping, actual it's not. Secondly, It's not reasonable to allow KEXEC_CORE select CRASH_CORE. Because KEXEC_CORE enables codes which allocate control pages, copy kexec/kdump segments, and prepare for switching. These codes are shared by both kexec reboot and kdump. We could want kexec reboot, but disable kdump. In that case, CRASH_CORE should not be selected. -------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC=y CONFIG_KEXEC_FILE=y --------------------- Thirdly, It's not reasonable to allow CRASH_DUMP select KEXEC_CORE. That could make KEXEC_CORE, CRASH_DUMP are enabled independently from KEXEC or KEXEC_FILE. However, w/o KEXEC or KEXEC_FILE, the KEXEC_CORE code built in doesn't make any sense because no kernel loading or switching will happen to utilize the KEXEC_CORE code. --------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y --------------------- In this case, what is worse, on arch sh and arm, KEXEC relies on MMU, while CRASH_DUMP can still be enabled when !MMU, then compiling error is seen as the lkp test robot reported in above link. ------arch/sh/Kconfig------ config ARCH_SUPPORTS_KEXEC def_bool MMU config ARCH_SUPPORTS_CRASH_DUMP def_bool BROKEN_ON_SMP --------------------------- Changes: =========== 1, split out crash_reserve.c from crash_core.c; 2, split out vmcore_infoc. from crash_core.c; 3, move crash related codes in kexec_core.c into crash_core.c; 4, remove dependency of FA_DUMP on CRASH_DUMP; 5, clean up kdump related config items; 6, wrap up crash codes in crash related ifdefs on all 8 arch-es which support crash dumping, except of ppc; Achievement: =========== With above changes, I can rearrange the config item logic as below (the right item depends on or is selected by the left item): PROC_KCORE -----------> VMCORE_INFO |----------> VMCORE_INFO FA_DUMP----| |----------> CRASH_RESERVE ---->VMCORE_INFO / |---->CRASH_RESERVE KEXEC --| /| |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE KEXEC_FILE --| \ | \---->CRASH_HOTPLUG KEXEC --| |--> KEXEC_CORE (for kexec reboot only) KEXEC_FILE --| Test ======== On all 8 architectures, including x86_64, arm64, s390x, sh, arm, mips, riscv, loongarch, I did below three cases of config item setting and building all passed. Take configs on x86_64 as exampmle here: (1) Both CONFIG_KEXEC and KEXEC_FILE is unset, then all kexec/kdump items are unset automatically: # Kexec and crash features # CONFIG_KEXEC is not set # CONFIG_KEXEC_FILE is not set # end of Kexec and crash features (2) set CONFIG_KEXEC_FILE and 'make olddefconfig': --------------- # Kexec and crash features CONFIG_CRASH_RESERVE=y CONFIG_VMCORE_INFO=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC_FILE=y CONFIG_CRASH_DUMP=y CONFIG_CRASH_HOTPLUG=y CONFIG_CRASH_MAX_MEMORY_RANGES=8192 # end of Kexec and crash features --------------- (3) unset CONFIG_CRASH_DUMP in case 2 and execute 'make olddefconfig': ------------------------ # Kexec and crash features CONFIG_KEXEC_CORE=y CONFIG_KEXEC_FILE=y # end of Kexec and crash features ------------------------ Note: For ppc, it needs investigation to make clear how to split out crash code in arch folder. Hope Hari and Pingfan can help have a look, see if it's doable. Now, I make it either have both kexec and crash enabled, or disable both of them altogether. This patch (of 14): Both kdump and fa_dump of ppc rely on crashkernel reservation. Move the relevant codes into separate files: crash_reserve.c, include/linux/crash_reserve.h. And also add config item CRASH_RESERVE to control its enabling of the codes. And update config items which has relationship with crashkernel reservation. And also change ifdeffery from CONFIG_CRASH_CORE to CONFIG_CRASH_RESERVE when those scopes are only crashkernel reservation related. And also rename arch/XXX/include/asm/{crash_core.h => crash_reserve.h} on arm64, x86 and risc-v because those architectures' crash_core.h is only related to crashkernel reservation. [[email protected]: s/CRASH_RESEERVE/CRASH_RESERVE/, per Klara Modin] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Acked-by: Hari Bathini <[email protected]> Cc: Al Viro <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Klara Modin <[email protected]> Cc: Michael Kelley <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Yang Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/vmalloc: remove vmap_area_listBaoquan He2-4/+1
Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile get the base address of vmalloc area. Now, vmap_area_list is empty, so export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list. [[email protected]: fix a warning in the crash_save_vmcoreinfo_init()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23cred: Use KMEM_CACHE() instead of kmem_cache_create()Kunwu Chan1-2/+2
Commit 0a31bd5f2bbb ("KMEM_CACHE(): simplify slab cache creation") introduces a new macro. Use the new KMEM_CACHE() macro instead of direct kmem_cache_create() to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <[email protected]> [PM: alignment fixes in both code and description] Signed-off-by: Paul Moore <[email protected]>
2024-02-23genirq/matrix: Dynamic bitmap allocationBjörn Töpel1-11/+17
A future user of the matrix allocator, does not know the size of the matrix bitmaps at compile time. To avoid wasting memory on unnecessary large bitmaps, size the bitmap at matrix allocation time. Signed-off-by: Björn Töpel <[email protected]> Signed-off-by: Anup Patel <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22bpf: add is_async_callback_calling_insn() helperBenjamin Tissoires1-4/+7
Currently we have a special case for BPF_FUNC_timer_set_callback, let's introduce a helper we can extend for the kfunc that will come in a later patch Signed-off-by: Benjamin Tissoires <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-02-22bpf: introduce in_sleepable() helperBenjamin Tissoires1-6/+11
No code change, but it'll allow to have only one place to change everything when we add in_sleepable in cur_state. Signed-off-by: Benjamin Tissoires <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-02-22bpf: allow more maps in sleepable bpf programsBenjamin Tissoires1-0/+2
These 2 maps types are required for HID-BPF when a user wants to do IO with a device from a sleepable tracing point. Allowing BPF_MAP_TYPE_QUEUE (and therefore BPF_MAP_TYPE_STACK) allows for a BPF program to prepare from an IRQ the list of HID commands to send back to the device and then these commands can be retrieved from the sleepable trace point. Signed-off-by: Benjamin Tissoires <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-02-22panic: add option to dump blocked tasks in panic_printFeng Tang1-0/+4
For debugging kernel panics and other bugs, there is already an option of panic_print to dump all tasks' call stacks. On today's large servers running many containers, there could be thousands of tasks or more, and this will print out huge amount of call stacks, taking a lot of time (for serial console which is main target user case of panic_print). And in many cases, only those several tasks being blocked are key for the panic, so add an option to only dump blocked tasks' call stacks. [[email protected]: clarify documentation a little] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Feng Tang <[email protected]> Tested-by: Guilherme G. Piccoli <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22ptrace_attach: shift send(SIGSTOP) into ptrace_set_stopped()Oleg Nesterov1-8/+5
Turn send_sig_info(SIGSTOP) into send_signal_locked(SIGSTOP) and move it from ptrace_attach() to ptrace_set_stopped(). This looks more logical and avoids lock(siglock) right after unlock(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22user_namespace: remove unnecessary NULL values from kbufLi zeming1-1/+1
kbuf is assigned first, so it does not need to initialize the assignment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Li zeming <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22panic: suppress gnu_printf warningBaoquan He1-0/+5
with GCC 13.2.1 and W=1, there's compiling warning like this: kernel/panic.c: In function `__warn': kernel/panic.c:676:17: warning: function `__warn' might be a candidate for `gnu_printf' format attribute [-Wsuggest-attribute=format] 676 | vprintk(args->fmt, args->args); | ^~~~~~~ The normal __printf(x,y) adding can't fix it. So add workaround which disables -Wsuggest-attribute=format to mute it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22bounds: support non-power-of-two CONFIG_NR_CPUSMatthew Wilcox (Oracle)1-1/+1
ilog2() rounds down, so for example when PowerPC 85xx sets CONFIG_NR_CPUS to 24, we will only allocate 4 bits to store the number of CPUs instead of 5. Use bits_per() instead, which rounds up. Found by code inspection. The effect of this would probably be a misaccounting when doing NUMA balancing, so to a user, it would only be a performance penalty. The effects may be more wide-spread; it's hard to tell. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Fixes: 90572890d202 ("mm: numa: Change page last {nid,pid} into {cpu,pid}") Reviewed-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski10-12/+29
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/udp.c f796feabb9f5 ("udp: add local "peek offset enabled" flag") 56667da7399e ("net: implement lockless setsockopt(SO_PEEK_OFF)") Adjacent changes: net/unix/garbage.c aa82ac51d633 ("af_unix: Drop oob_skb ref before purging queue in GC.") 11498715f266 ("af_unix: Remove io_uring code for GC.") Signed-off-by: Jakub Kicinski <[email protected]>
2024-02-22hrtimer: Select housekeeping CPU during migrationCosta Shulyupin1-1/+2
During CPU-down hotplug, hrtimers may migrate to isolated CPUs, compromising CPU isolation. Address this issue by masking valid CPUs for hrtimers using housekeeping_cpumask(HK_TYPE_TIMER). Suggested-by: Waiman Long <[email protected]> Signed-off-by: Costa Shulyupin <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Waiman Long <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22bpf: Check cfi_stubs before registering a struct_ops type.Kui-Feng Lee1-0/+5
Recently, st_ops->cfi_stubs was introduced. However, the upcoming new struct_ops support (e.g. sched_ext) is not aware of this and does not provide its own cfi_stubs. The kernel ends up NULL dereferencing the st_ops->cfi_stubs. Considering struct_ops supports kernel module now, this NULL check is necessary. This patch is to reject struct_ops registration that does not provide a cfi_stubs. Signed-off-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2024-02-22PM: EM: Fix nr_states warnings in static checksLukasz Luba1-2/+1
During the static checks nr_states has been mentioned by the kernel test robot. Fix the warning in those 2 places. Reported-by: kernel test robot <[email protected]> Signed-off-by: Lukasz Luba <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-02-22PM: hibernate: Don't ignore return from set_memory_ro()Christophe Leroy4-15/+24
set_memory_ro() and set_memory_rw() can fail, leaving memory unprotected. Take the returned value into account and abort in case of failure. Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-02-22PM: hibernate: Support to select compression algorithmNikhil V1-3/+54
Currently the default compression algorithm is selected based on compile time options. Introduce a module parameter "hibernate.compressor" to override this behaviour. Different compression algorithms have different characteristics and hibernation may benefit when it uses any of these algorithms, especially when a secondary algorithm(LZ4) offers better decompression speeds over a default algorithm(LZO), which in turn reduces hibernation image restore time. Users can override the default algorithm in two ways: 1) Passing "hibernate.compressor" as kernel command line parameter. Usage: LZO: hibernate.compressor=lzo LZ4: hibernate.compressor=lz4 2) Specifying the algorithm at runtime. Usage: LZO: echo lzo > /sys/module/hibernate/parameters/compressor LZ4: echo lz4 > /sys/module/hibernate/parameters/compressor Currently LZO and LZ4 are the supported algorithms. LZO is the default compression algorithm used with hibernation. Signed-off-by: Nikhil V <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>