aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-02-22bpf: add is_async_callback_calling_insn() helperBenjamin Tissoires1-4/+7
Currently we have a special case for BPF_FUNC_timer_set_callback, let's introduce a helper we can extend for the kfunc that will come in a later patch Signed-off-by: Benjamin Tissoires <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-02-22bpf: introduce in_sleepable() helperBenjamin Tissoires1-6/+11
No code change, but it'll allow to have only one place to change everything when we add in_sleepable in cur_state. Signed-off-by: Benjamin Tissoires <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-02-22bpf: allow more maps in sleepable bpf programsBenjamin Tissoires1-0/+2
These 2 maps types are required for HID-BPF when a user wants to do IO with a device from a sleepable tracing point. Allowing BPF_MAP_TYPE_QUEUE (and therefore BPF_MAP_TYPE_STACK) allows for a BPF program to prepare from an IRQ the list of HID commands to send back to the device and then these commands can be retrieved from the sleepable trace point. Signed-off-by: Benjamin Tissoires <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-02-22panic: add option to dump blocked tasks in panic_printFeng Tang1-0/+4
For debugging kernel panics and other bugs, there is already an option of panic_print to dump all tasks' call stacks. On today's large servers running many containers, there could be thousands of tasks or more, and this will print out huge amount of call stacks, taking a lot of time (for serial console which is main target user case of panic_print). And in many cases, only those several tasks being blocked are key for the panic, so add an option to only dump blocked tasks' call stacks. [[email protected]: clarify documentation a little] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Feng Tang <[email protected]> Tested-by: Guilherme G. Piccoli <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22ptrace_attach: shift send(SIGSTOP) into ptrace_set_stopped()Oleg Nesterov1-8/+5
Turn send_sig_info(SIGSTOP) into send_signal_locked(SIGSTOP) and move it from ptrace_attach() to ptrace_set_stopped(). This looks more logical and avoids lock(siglock) right after unlock(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22user_namespace: remove unnecessary NULL values from kbufLi zeming1-1/+1
kbuf is assigned first, so it does not need to initialize the assignment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Li zeming <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22panic: suppress gnu_printf warningBaoquan He1-0/+5
with GCC 13.2.1 and W=1, there's compiling warning like this: kernel/panic.c: In function `__warn': kernel/panic.c:676:17: warning: function `__warn' might be a candidate for `gnu_printf' format attribute [-Wsuggest-attribute=format] 676 | vprintk(args->fmt, args->args); | ^~~~~~~ The normal __printf(x,y) adding can't fix it. So add workaround which disables -Wsuggest-attribute=format to mute it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22bounds: support non-power-of-two CONFIG_NR_CPUSMatthew Wilcox (Oracle)1-1/+1
ilog2() rounds down, so for example when PowerPC 85xx sets CONFIG_NR_CPUS to 24, we will only allocate 4 bits to store the number of CPUs instead of 5. Use bits_per() instead, which rounds up. Found by code inspection. The effect of this would probably be a misaccounting when doing NUMA balancing, so to a user, it would only be a performance penalty. The effects may be more wide-spread; it's hard to tell. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Fixes: 90572890d202 ("mm: numa: Change page last {nid,pid} into {cpu,pid}") Reviewed-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski10-12/+29
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/udp.c f796feabb9f5 ("udp: add local "peek offset enabled" flag") 56667da7399e ("net: implement lockless setsockopt(SO_PEEK_OFF)") Adjacent changes: net/unix/garbage.c aa82ac51d633 ("af_unix: Drop oob_skb ref before purging queue in GC.") 11498715f266 ("af_unix: Remove io_uring code for GC.") Signed-off-by: Jakub Kicinski <[email protected]>
2024-02-22hrtimer: Select housekeeping CPU during migrationCosta Shulyupin1-1/+2
During CPU-down hotplug, hrtimers may migrate to isolated CPUs, compromising CPU isolation. Address this issue by masking valid CPUs for hrtimers using housekeeping_cpumask(HK_TYPE_TIMER). Suggested-by: Waiman Long <[email protected]> Signed-off-by: Costa Shulyupin <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Waiman Long <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22bpf: Check cfi_stubs before registering a struct_ops type.Kui-Feng Lee1-0/+5
Recently, st_ops->cfi_stubs was introduced. However, the upcoming new struct_ops support (e.g. sched_ext) is not aware of this and does not provide its own cfi_stubs. The kernel ends up NULL dereferencing the st_ops->cfi_stubs. Considering struct_ops supports kernel module now, this NULL check is necessary. This patch is to reject struct_ops registration that does not provide a cfi_stubs. Signed-off-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2024-02-22PM: EM: Fix nr_states warnings in static checksLukasz Luba1-2/+1
During the static checks nr_states has been mentioned by the kernel test robot. Fix the warning in those 2 places. Reported-by: kernel test robot <[email protected]> Signed-off-by: Lukasz Luba <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-02-22PM: hibernate: Don't ignore return from set_memory_ro()Christophe Leroy4-15/+24
set_memory_ro() and set_memory_rw() can fail, leaving memory unprotected. Take the returned value into account and abort in case of failure. Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-02-22PM: hibernate: Support to select compression algorithmNikhil V1-3/+54
Currently the default compression algorithm is selected based on compile time options. Introduce a module parameter "hibernate.compressor" to override this behaviour. Different compression algorithms have different characteristics and hibernation may benefit when it uses any of these algorithms, especially when a secondary algorithm(LZ4) offers better decompression speeds over a default algorithm(LZO), which in turn reduces hibernation image restore time. Users can override the default algorithm in two ways: 1) Passing "hibernate.compressor" as kernel command line parameter. Usage: LZO: hibernate.compressor=lzo LZ4: hibernate.compressor=lz4 2) Specifying the algorithm at runtime. Usage: LZO: echo lzo > /sys/module/hibernate/parameters/compressor LZ4: echo lz4 > /sys/module/hibernate/parameters/compressor Currently LZO and LZ4 are the supported algorithms. LZO is the default compression algorithm used with hibernation. Signed-off-by: Nikhil V <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-02-22mm/cma: drop CONFIG_CMA_DEBUGAnshuman Khandual1-6/+0
All pr_debug() prints in (mm/cma.c) could be enabled via standard Makefile based method. Besides cma_debug_show_areas() should always be called during cma_alloc() failure path. This seemingly redundant config, CONFIG_CMA_DEBUG can be dropped without any problem. [[email protected]: remove debug code to removed CONFIG_CMA_DEBUG] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Lukas Bulwahn <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Robin Murphy <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22Merge tag 'net-6.8.0-rc6' of ↵Linus Torvalds3-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf and netfilter. Current release - regressions: - af_unix: fix another unix GC hangup Previous releases - regressions: - core: fix a possible AF_UNIX deadlock - bpf: fix NULL pointer dereference in sk_psock_verdict_data_ready() - netfilter: nft_flow_offload: release dst in case direct xmit path is used - bridge: switchdev: ensure MDB events are delivered exactly once - l2tp: pass correct message length to ip6_append_data - dccp/tcp: unhash sk from ehash for tb2 alloc failure after check_estalblished() - tls: fixes for record type handling with PEEK - devlink: fix possible use-after-free and memory leaks in devlink_init() Previous releases - always broken: - bpf: fix an oops when attempting to read the vsyscall page through bpf_probe_read_kernel - sched: act_mirred: use the backlog for mirred ingress - netfilter: nft_flow_offload: fix dst refcount underflow - ipv6: sr: fix possible use-after-free and null-ptr-deref - mptcp: fix several data races - phonet: take correct lock to peek at the RX queue Misc: - handful of fixes and reliability improvements for selftests" * tag 'net-6.8.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) l2tp: pass correct message length to ip6_append_data net: phy: realtek: Fix rtl8211f_config_init() for RTL8211F(D)(I)-VD-CG PHY selftests: ioam: refactoring to align with the fix Fix write to cloned skb in ipv6_hop_ioam() phonet/pep: fix racy skb_queue_empty() use phonet: take correct lock to peek at the RX queue net: sparx5: Add spinlock for frame transmission from CPU net/sched: flower: Add lock protection when remove filter handle devlink: fix port dump cmd type net: stmmac: Fix EST offset for dwmac 5.10 tools: ynl: don't leak mcast_groups on init error tools: ynl: make sure we always pass yarg to mnl_cb_run net: mctp: put sock on tag allocation failure netfilter: nf_tables: use kzalloc for hook allocation netfilter: nf_tables: register hooks last when adding new chain/flowtable netfilter: nft_flow_offload: release dst in case direct xmit path is used netfilter: nft_flow_offload: reset dst in route object after setting up flow netfilter: nf_tables: set dormant flag on hook register failure selftests: tls: add test for peeking past a record of a different type selftests: tls: add test for merging of same-type control messages ...
2024-02-22workqueue: Control intensive warning threshold through cmdlineXuewen Yan1-3/+11
When CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel will report the work functions which violate the intensive_threshold_us repeatedly. And now, only when the violate times exceed 4 and is a power of 2, the kernel warning could be triggered. However, sometimes, even if a long work execution time occurs only once, it may cause other work to be delayed for a long time. This may also cause some problems sometimes. In order to freely control the threshold of warninging, a boot argument is added so that the user can control the warning threshold to be printed. At the same time, keep the exponential backoff to prevent reporting too much. By default, the warning threshold is 4. tj: Updated kernel-parameters.txt description. Signed-off-by: Xuewen Yan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-02-22Merge tag 'trace-v6.8-rc5' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: - While working on the ring buffer I noticed that the counter used for knowing where the end of the data is on a sub-buffer was not a full "int" but just 20 bits. It was masked out to 0xfffff. With the new code that allows the user to change the size of the sub-buffer, it is theoretically possible to ask for a size bigger than 2^20. If that happens, unexpected results may occur as there's no code checking if the counter overflowed the 20 bits of the write mask. There are other checks to make sure events fit in the sub-buffer, but if the sub-buffer itself is too big, that is not checked. Add a check in the resize of the sub-buffer to make sure that it never goes beyond the size of the counter that holds how much data is on it. * tag 'trace-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Do not let subbuf be bigger than write mask
2024-02-22timers: Always queue timers on the local CPUAnna-Maria Behnsen1-21/+15
The timer pull model is in place so we can remove the heuristics which try to guess the best target CPU at enqueue/modification time. All non pinned timers are queued on the local CPU in the separate storage and eventually pulled at expiry time to a remote CPU. Originally-by: Richard Cochran (linutronix GmbH) <[email protected]> Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timer_migration: Add tracepointsAnna-Maria Behnsen1-0/+26
The timer pull logic needs proper debugging aids. Add tracepoints so the hierarchical idle machinery can be diagnosed. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Implement the hierarchical pull modelAnna-Maria Behnsen5-8/+2010
Placing timers at enqueue time on a target CPU based on dubious heuristics does not make any sense: 1) Most timer wheel timers are canceled or rearmed before they expire. 2) The heuristics to predict which CPU will be busy when the timer expires are wrong by definition. So placing the timers at enqueue wastes precious cycles. The proper solution to this problem is to always queue the timers on the local CPU and allow the non pinned timers to be pulled onto a busy CPU at expiry time. Therefore split the timer storage into local pinned and global timers: Local pinned timers are always expired on the CPU on which they have been queued. Global timers can be expired on any CPU. As long as a CPU is busy it expires both local and global timers. When a CPU goes idle it arms for the first expiring local timer. If the first expiring pinned (local) timer is before the first expiring movable timer, then no action is required because the CPU will wake up before the first movable timer expires. If the first expiring movable timer is before the first expiring pinned (local) timer, then this timer is queued into an idle timerqueue and eventually expired by another active CPU. To avoid global locking the timerqueues are implemented as a hierarchy. The lowest level of the hierarchy holds the CPUs. The CPUs are associated to groups of 8, which are separated per node. If more than one CPU group exist, then a second level in the hierarchy collects the groups. Depending on the size of the system more than 2 levels are required. Each group has a "migrator" which checks the timerqueue during the tick for remote expirable timers. If the last CPU in a group goes idle it reports the first expiring event in the group up to the next group(s) in the hierarchy. If the last CPU goes idle it arms its timer for the first system wide expiring timer to ensure that no timer event is missed. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Introduce function to check timer base is_idle flagAnna-Maria Behnsen2-0/+11
To prepare for the conversion of the NOHZ timer placement to a pull at expiry time model it's required to have a function that returns the value of the is_idle flag of the timer base to keep the hierarchy states during online in sync with timer base state. No functional change. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22tick/sched: Split out jiffies update helper functionRichard Cochran (linutronix GmbH)2-7/+20
The logic to get the time of the last jiffies update will be needed by the timer pull model as well. Move the code into a global function in anticipation of the new caller. No functional change. Signed-off-by: Richard Cochran (linutronix GmbH) <[email protected]> Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Check if timers base is handled alreadyAnna-Maria Behnsen1-0/+3
Due to the conversion of the NOHZ timer placement to a pull at expiry time model, the per CPU timer bases with non pinned timers are no longer handled only by the local CPU. In case a remote CPU already expires the non pinned timers base of the local CPU, nothing more needs to be done by the local CPU. A check at the begin of the expire timers routine is required, because timer base lock is dropped before executing the timer callback function. This is a preparatory work, but has no functional impact right now. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Restructure internal lockingRichard Cochran (linutronix GmbH)1-10/+21
Move the locking out from __run_timers() to the call sites, so the protected section can be extended at the call site. Preparatory work for changing the NOHZ timer placement to a pull at expiry time model. No functional change. Signed-off-by: Richard Cochran (linutronix GmbH) <[email protected]> Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Add get next timer interrupt functionality for remote CPUsAnna-Maria Behnsen2-5/+100
To prepare for the conversion of the NOHZ timer placement to a pull at expiry time model it's required to have functionality available getting the next timer interrupt on a remote CPU. Locking of the timer bases and getting the information for the next timer interrupt functionality is split into separate functions. This is required to be compliant with lock ordering when the new model is in place. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Split out "get next timer interrupt" functionalityAnna-Maria Behnsen1-26/+38
The functionality for getting the next timer interrupt in get_next_timer_interrupt() is split into a separate function fetch_next_timer_interrupt() to be usable by other call sites. This is preparatory work for the conversion of the NOHZ timer placement to a pull at expiry time model. No functional change. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Retrieve next expiry of pinned/non-pinned timers separatelyAnna-Maria Behnsen1-4/+31
For the conversion of the NOHZ timer placement to a pull at expiry time model it's required to have separate expiry times for the pinned and the non-pinned (movable) timers. Therefore struct timer_events is introduced. No functional change Originally-by: Richard Cochran (linutronix GmbH) <[email protected]> Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Keep the pinned timers separate from the othersAnna-Maria Behnsen1-29/+56
Separate the storage space for pinned timers. Deferrable timers (doesn't matter if pinned or non pinned) are still enqueued into their own base. This is preparatory work for changing the NOHZ timer placement from a push at enqueue time to a pull at expiry time model. Originally-by: Richard Cochran (linutronix GmbH) <[email protected]> Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Split next timer interrupt logicAnna-Maria Behnsen1-13/+19
Split the logic for getting next timer interrupt (no matter of recalculated or already stored in base->next_expiry) into a separate function named next_timer_interrupt(). Make it available to local call sites only. No functional change. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Simplify code in run_local_timers()Anna-Maria Behnsen1-8/+6
The logic for raising a softirq the way it is implemented right now, is readable for two timer bases. When increasing the number of timer bases, code gets harder to read. With the introduction of the timer migration hierarchy, there will be three timer bases. Therefore restructure the code to use a loop. No functional change. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Make sure TIMER_PINNED flag is set in add_timer_on()Anna-Maria Behnsen1-1/+7
When adding a timer to the timer wheel using add_timer_on(), it is an implicitly pinned timer. With the timer pull at expiry time model in place, the TIMER_PINNED flag is required to make sure timers end up in proper base. Set the TIMER_PINNED flag unconditionally when add_timer_on() is executed. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22workqueue: Use global variant for add_timer()Anna-Maria Behnsen1-1/+1
The implementation of the NOHZ pull at expiry model will change the timer bases per CPU. Timers, that have to expire on a specific CPU, require the TIMER_PINNED flag. If the CPU doesn't matter, the TIMER_PINNED flag must be dropped. This is required for call sites which use the timer alternately as pinned and not pinned timer like workqueues do. Therefore use add_timer_global() in __queue_delayed_work() for non-bound delayed work to make sure the TIMER_PINNED flag is dropped. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Introduce add_timer() variants which modify timer flagsAnna-Maria Behnsen1-0/+34
A timer might be used as a pinned timer (using add_timer_on()) and later on as non-pinned timer using add_timer(). When the "NOHZ timer pull at expiry model" is in place, the TIMER_PINNED flag is required to be used whenever a timer needs to expire on a dedicated CPU. Otherwise the flag must not be set if expiration on a dedicated CPU is not required. add_timer_on()'s behavior will be changed during the preparation patches for the "NOHZ timer pull at expiry model" to unconditionally set the TIMER_PINNED flag. To be able to clear/ set the flag when queueing a timer, two variants of add_timer() are introduced. This is a preparatory step and has no functional change. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Optimization for timer_base_try_to_set_idle()Anna-Maria Behnsen2-4/+9
When tick is stopped also the timer base is_idle flag is set. When reentering timer_base_try_to_set_idle() with the tick stopped, there is no need to check whether the timer base needs to be set idle again. When a timer was enqueued in the meantime, this is already handled by the tick_nohz_next_event() call which was executed before tick_nohz_stop_tick(). Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Move marking timer bases idle into tick_nohz_stop_tick()Anna-Maria Behnsen3-30/+71
The timer base is marked idle when get_next_timer_interrupt() is executed. But the decision whether the tick will be stopped and whether the system is able to go idle is done later. When the timer bases is marked idle and a new first timer is enqueued remote an IPI is raised. Even if it is not required because the tick is not stopped and the timer base is evaluated again at the next tick. To prevent this, the timer base is marked idle in tick_nohz_stop_tick() and get_next_timer_interrupt() is streamlined by only looking for the next timer interrupt. All other work is postponed to timer_base_try_to_set_idle() which is called by tick_nohz_stop_tick(). timer_base_try_to_set_idle() never resets timer_base::is_idle state. This is done when the tick is restarted via tick_nohz_restart_sched_tick(). With this, tick_sched::tick_stopped and timer_base::is_idle are always in sync. So there is no longer the need to execute timer_clear_idle() in tick_nohz_idle_retain_tick(). This was required before, as tick_nohz_next_event() set timer_base::is_idle even if the tick would not be stopped. So timer_clear_idle() is only executed, when timer base is idle. So the check whether timer base is idle, is now no longer required as well. While at it fix some nearby whitespace damage as well. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Split out get next timer interruptAnna-Maria Behnsen1-9/+14
Split out get_next_timer_interrupt() to be able to extend it and make it reusable for other call sites. No functional change. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22timers: Restructure get_next_timer_interrupt()Anna-Maria Behnsen1-6/+6
get_next_timer_interrupt() contains two parts for the next timer interrupt calculation. Those two parts are separated by forwarding the base clock. But the second part does not depend on the forwarded base clock. Therefore restructure get_next_timer_interrupt() to keep things together which belong together. No functional change. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22cpu: Remove stray semicolonMax Kellermann1-1/+1
This syntax error was introduced by commit da92df490eea ("cpu: Mark cpu_possible_mask as __ro_after_init"). Fixes: da92df490eea ("cpu: Mark cpu_possible_mask as __ro_after_init") Signed-off-by: Max Kellermann <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-22Merge tag 'for-netdev' of ↵Paolo Abeni3-1/+8
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-02-22 The following pull-request contains BPF updates for your *net* tree. We've added 11 non-merge commits during the last 24 day(s) which contain a total of 15 files changed, 217 insertions(+), 17 deletions(-). The main changes are: 1) Fix a syzkaller-triggered oops when attempting to read the vsyscall page through bpf_probe_read_kernel and friends, from Hou Tao. 2) Fix a kernel panic due to uninitialized iter position pointer in bpf_iter_task, from Yafang Shao. 3) Fix a race between bpf_timer_cancel_and_free and bpf_timer_cancel, from Martin KaFai Lau. 4) Fix a xsk warning in skb_add_rx_frag() (under CONFIG_DEBUG_NET) due to incorrect truesize accounting, from Sebastian Andrzej Siewior. 5) Fix a NULL pointer dereference in sk_psock_verdict_data_ready, from Shigeru Yoshida. 6) Fix a resolve_btfids warning when bpf_cpumask symbol cannot be resolved, from Hari Bathini. bpf-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, sockmap: Fix NULL pointer dereference in sk_psock_verdict_data_ready() selftests/bpf: Add negtive test cases for task iter bpf: Fix an issue due to uninitialized bpf_iter_task selftests/bpf: Test racing between bpf_timer_cancel_and_free and bpf_timer_cancel bpf: Fix racing between bpf_timer_cancel_and_free and bpf_timer_cancel selftest/bpf: Test the read of vsyscall page under x86-64 x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h bpf, scripts: Correct GPL license name xsk: Add truesize to skb_add_rx_frag(). bpf: Fix warning for bpf_cpumask in verifier ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2024-02-21mm: convert mm_counter_file() to take a folioKefeng Wang1-1/+1
Now all callers of mm_counter_file() have a folio, convert mm_counter_file() to take a folio. Saves a call to compound_head() hidden inside PageSwapBacked(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-21ring-buffer: Do not let subbuf be bigger than write maskSteven Rostedt (Google)1-0/+4
The data on the subbuffer is measured by a write variable that also contains status flags. The counter is just 20 bits in length. If the subbuffer is bigger than then counter, it will fail. Make sure that the subbuffer can not be set to greater than the counter that keeps track of the data on the subbuffer. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Fixes: 2808e31ec12e5 ("ring-buffer: Add interface for configuring trace sub buffer size") Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-02-21clocksource: Scale the watchdog read retries automaticallyFeng Tang2-12/+11
On a 8-socket server the TSC is wrongly marked as 'unstable' and disabled during boot time on about one out of 120 boot attempts: clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns, wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable tsc: Marking TSC unstable due to clocksource watchdog TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152) clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896. clocksource: Switched to clocksource hpet The reason is that for platform with a large number of CPUs, there are sporadic big or huge read latencies while reading the watchog/clocksource during boot or when system is under stress work load, and the frequency and maximum value of the latency goes up with the number of online CPUs. The cCurrent code already has logic to detect and filter such high latency case by reading the watchdog twice and checking the two deltas. Due to the randomness of the latency, there is a low probabilty that the first delta (latency) is big, but the second delta is small and looks valid. The watchdog code retries the readouts by default twice, which is not necessarily sufficient for systems with a large number of CPUs. There is a command line parameter 'max_cswd_read_retries' which allows to increase the number of retries, but that's not user friendly as it needs to be tweaked per system. As the number of required retries is proportional to the number of online CPUs, this parameter can be calculated at runtime. Scale and enlarge the number of retries according to the number of online CPUs and remove the command line parameter completely. [ tglx: Massaged change log and comments ] Signed-off-by: Feng Tang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Jin Wang <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Reviewed-by: Waiman Long <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-21time/kunit: Use correct format specifierDavid Gow1-1/+1
'days' is a s64 (from div_s64), and so should use a %lld specifier. This was found by extending KUnit's assertion macros to use gcc's __printf attribute. Fixes: 276010551664 ("time: Improve performance of time64_to_tm()") Signed-off-by: David Gow <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-21pidfd: allow to override signal scope in pidfd_send_signal()Christian Brauner1-9/+37
Right now we determine the scope of the signal based on the type of pidfd. There are use-cases where it's useful to override the scope of the signal. For example in [1]. Add flags to determine the scope of the signal: (1) PIDFD_SIGNAL_THREAD: send signal to specific thread reference by @pidfd (2) PIDFD_SIGNAL_THREAD_GROUP: send signal to thread-group of @pidfd (2) PIDFD_SIGNAL_PROCESS_GROUP: send signal to process-group of @pidfd Since we now allow specifying PIDFD_SEND_PROCESS_GROUP for pidfd_send_signal() to send signals to process groups we need to adjust the check restricting si_code emulation by userspace to account for PIDTYPE_PGID. Reviewed-by: Oleg Nesterov <[email protected]> Link: https://github.com/systemd/systemd/issues/31093 [1] Link: https://lore.kernel.org/r/20240210-chihuahua-hinzog-3945b6abd44a@brauner Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-02-20workqueue: Make @flags handling consistent across set_work_data() and friendsTejun Heo1-16/+16
- set_work_data() takes a separate @flags argument but just ORs it to @data. This is more confusing than helpful. Just take @data. - Use the name @flags consistently and add the parameter to set_work_pool_and_{keep|clear}_pending(). This will be used by the planned disable/enable support. No functional changes. Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]>
2024-02-20workqueue: Remove clear_work_data()Tejun Heo1-16/+8
clear_work_data() is only used in one place and immediately followed by smp_mb(), making it equivalent to set_work_pool_and_clear_pending() w/ WORK_OFFQ_POOL_NONE for @pool_id. Drop it. No functional changes. Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]>
2024-02-20workqueue: Factor out work_grab_pending() from __cancel_work_sync()Tejun Heo1-52/+80
The planned disable/enable support will need the same logic. Let's factor it out. No functional changes. v2: Update function comment to include @irq_flags. Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]>
2024-02-20workqueue: Clean up enum work_bits and related constantsTejun Heo1-4/+4
The bits of work->data are used for a few different purposes. How the bits are used is determined by enum work_bits. The planned disable/enable support will add another use, so let's clean it up a bit in preparation. - Let WORK_STRUCT_*_BIT's values be determined by enum definition order. - Deliminate different bit sections the same way using SHIFT and BITS values. - Rename __WORK_OFFQ_CANCELING to WORK_OFFQ_CANCELING_BIT for consistency. - Introduce WORK_STRUCT_PWQ_SHIFT and replace WORK_STRUCT_FLAG_MASK and WORK_STRUCT_WQ_DATA_MASK with WQ_STRUCT_PWQ_MASK for clarity. - Improve documentation. No functional changes. Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]>
2024-02-20workqueue: Introduce work_cancel_flagsTejun Heo1-12/+17
The cancel path used bool @is_dwork to distinguish canceling a regular work and a delayed one. The planned disable/enable support will need passing around another flag in the code path. As passing them around with bools will be confusing, let's introduce named flags to pass around in the cancel path. WORK_CANCEL_DELAYED replaces @is_dwork. No functional changes. Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]>