aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-09-24sched/fair: Fix warning in bandwidth distributionJosh Don1-11/+25
We've observed the following warning being hit in distribute_cfs_runtime(): SCHED_WARN_ON(cfs_rq->runtime_remaining > 0) We have the following race: - CPU 0: running bandwidth distribution (distribute_cfs_runtime). Inspects the local cfs_rq and makes its runtime_remaining positive. However, we defer unthrottling the local cfs_rq until after considering all remote cfs_rq's. - CPU 1: starts running bandwidth distribution from the slack timer. When it finds the cfs_rq for CPU 0 on the throttled list, it observers the that the cfs_rq is throttled, yet is not on the CSD list, and has a positive runtime_remaining, thus triggering the warning in distribute_cfs_runtime. To fix this, we can rework the local unthrottling logic to put the local cfs_rq on a local list, so that any future bandwidth distributions will realize that the cfs_rq is about to be unthrottled. Signed-off-by: Josh Don <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-24sched/fair: Make cfs_rq->throttled_csd_list available on !SMPJosh Don2-6/+0
This makes the following patch cleaner by avoiding extra CONFIG_SMP conditionals on the availability of rq->throttled_csd_list. Signed-off-by: Josh Don <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-23Merge tag 'mm-hotfixes-stable-2023-09-23-10-31' of ↵Linus Torvalds2-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "13 hotfixes, 10 of which pertain to post-6.5 issues. The other three are cc:stable" * tag 'mm-hotfixes-stable-2023-09-23-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: proc: nommu: fix empty /proc/<pid>/maps filemap: add filemap_map_order0_folio() to handle order0 folio proc: nommu: /proc/<pid>/maps: release mmap read lock mm: memcontrol: fix GFP_NOFS recursion in memory.high enforcement pidfd: prevent a kernel-doc warning argv_split: fix kernel-doc warnings scatterlist: add missing function params to kernel-doc selftests/proc: fixup proc-empty-vm test after KSM changes revert "scripts/gdb/symbols: add specific ko module load command" selftests: link libasan statically for tests with -fsanitize=address task_work: add kerneldoc annotation for 'data' argument mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list sh: mm: re-add lost __ref to ioremap_prot() to fix modpost warning
2023-09-22ring-buffer: Fix bytes info in per_cpu buffer statsZheng Yejian1-13/+15
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of bytes in cpu buffer that have not been consumed. However, currently after consuming data by reading file 'trace_pipe', the 'bytes' info was not changed as expected. # cat per_cpu/cpu0/stats entries: 0 overrun: 0 commit overrun: 0 bytes: 568 <--- 'bytes' is problematical !!! oldest event ts: 8651.371479 now ts: 8653.912224 dropped events: 0 read events: 8 The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it: 1. When stat 'read_bytes', account consumed event in rb_advance_reader(); 2. When stat 'entries_bytes', exclude the discarded padding event which is smaller than minimum size because it is invisible to reader. Then use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for page-based read/remove/overrun. Also correct the comments of ring_buffer_bytes_cpu() in this patch. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events") Signed-off-by: Zheng Yejian <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-09-22Merge tag 'sched-urgent-2023-09-22' of ↵Linus Torvalds2-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a PF_IDLE initialization bug that generated warnings on tiny-RCU" * tag 'sched-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/sched: Modify initial boot task idle setup
2023-09-22hardening: Provide Kconfig fragments for basic optionsKees Cook1-0/+98
Inspired by Salvatore Mesoraca's earlier[1] efforts to provide some in-tree guidance for kernel hardening Kconfig options, add a new fragment named "hardening-basic.config" (along with some arch-specific fragments) that enable a basic set of kernel hardening options that have the least (or no) performance impact and remove a reasonable set of legacy APIs. Using this fragment is as simple as running "make hardening.config". More extreme fragments can be added[2] in the future to cover all the recognized hardening options, and more per-architecture files can be added too. For now, document the fragments directly via comments. Perhaps .rst documentation can be generated from them in the future (rather than the other way around). [1] https://lore.kernel.org/kernel-hardening/[email protected]/ [2] https://github.com/KSPP/linux/issues/14 Cc: Salvatore Mesoraca <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Kees Cook <[email protected]>
2023-09-22sched/debug: Avoid checking in_atomic_preempt_off() twice in schedule_debug()Liming Wu1-2/+1
in_atomic_preempt_off() already gets called in schedule_debug() once, which is the only caller of __schedule_bug(). Skip the second call within __schedule_bug(), it should always be true at this point. [ mingo: Clarified the changelog. ] Signed-off-by: Liming Wu <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-22locking/ww_mutex/test: Make sure we bail out instead of livelockJohn Stultz1-4/+5
I've seen what appears to be livelocks in the stress_inorder_work() function, and looking at the code it is clear we can have a case where we continually retry acquiring the locks and never check to see if we have passed the specified timeout. This patch reworks that function so we always check the timeout before iterating through the loop again. I believe others may have hit this previously here: https://lore.kernel.org/lkml/[email protected]/ Reported-by: Li Zhijian <[email protected]> Signed-off-by: John Stultz <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-22locking/ww_mutex/test: Fix potential workqueue corruptionJohn Stultz1-8/+12
In some cases running with the test-ww_mutex code, I was seeing odd behavior where sometimes it seemed flush_workqueue was returning before all the work threads were finished. Often this would cause strange crashes as the mutexes would be freed while they were being used. Looking at the code, there is a lifetime problem as the controlling thread that spawns the work allocates the "struct stress" structures that are passed to the workqueue threads. Then when the workqueue threads are finished, they free the stress struct that was passed to them. Unfortunately the workqueue work_struct node is in the stress struct. Which means the work_struct is freed before the work thread returns and while flush_workqueue is waiting. It seems like a better idea to have the controlling thread both allocate and free the stress structures, so that we can be sure we don't corrupt the workqueue by freeing the structure prematurely. So this patch reworks the test to do so, and with this change I no longer see the early flush_workqueue returns. Signed-off-by: John Stultz <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-22locking/ww_mutex/test: Use prng instead of rng to avoid hangs at bootupJohn Stultz1-2/+17
Booting w/ qemu without kvm, and with 64 cpus, I noticed we'd sometimes hung task watchdog splats in get_random_u32_below() when using the test-ww_mutex stress test. While entropy exhaustion is no longer an issue, the RNG may be slower early in boot. The test-ww_mutex code will spawn off 128 threads (2x cpus) and each thread will call get_random_u32_below() a number of times to generate a random order of the 16 locks. This intense use takes time and without kvm, qemu can be slow enough that we trip the hung task watchdogs. For this test, we don't need true randomness, just mixed up orders for testing ww_mutex lock acquisitions, so it changes the logic to use the prng instead, which takes less time and avoids the watchdgos. Feedback would be appreciated! Signed-off-by: John Stultz <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21cred: add get_cred_many and put_cred_manyMateusz Guzik1-11/+15
Some of the frequent consumers of get_cred and put_cred operate on 2 references on the same creds back-to-back. Switch them to doing the work in one go instead. Signed-off-by: Mateusz Guzik <[email protected]> [PM: removed changelog from commit description] Signed-off-by: Paul Moore <[email protected]>
2023-09-21bpf: Disable zero-extension for BPF_MEMSXIlya Leoshkevich1-1/+1
On the architectures that use bpf_jit_needs_zext(), e.g., s390x, the verifier incorrectly inserts a zero-extension after BPF_MEMSX, leading to miscompilations like the one below: 24: 89 1a ff fe 00 00 00 00 "r1 = *(s16 *)(r10 - 2);" # zext_dst set 0x3ff7fdb910e: lgh %r2,-2(%r13,%r0) # load halfword 0x3ff7fdb9114: llgfr %r2,%r2 # wrong! 25: 65 10 00 03 00 00 7f ff if r1 s> 32767 goto +3 <l0_1> # check_cond_jmp_op() Disable such zero-extensions. The JITs need to insert sign-extension themselves, if necessary. Suggested-by: Puranjay Mohan <[email protected]> Signed-off-by: Ilya Leoshkevich <[email protected]> Reviewed-by: Puranjay Mohan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-09-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni11-36/+184
Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Paolo Abeni <[email protected]>
2023-09-21Merge tag 'net-6.6-rc3' of ↵Linus Torvalds6-20/+142
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter and bpf. Current release - regressions: - bpf: adjust size_index according to the value of KMALLOC_MIN_SIZE - netfilter: fix entries val in rule reset audit log - eth: stmmac: fix incorrect rxq|txq_stats reference Previous releases - regressions: - ipv4: fix null-deref in ipv4_link_failure - netfilter: - fix several GC related issues - fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP - eth: team: fix null-ptr-deref when team device type is changed - eth: i40e: fix VF VLAN offloading when port VLAN is configured - eth: ionic: fix 16bit math issue when PAGE_SIZE >= 64KB Previous releases - always broken: - core: fix ETH_P_1588 flow dissector - mptcp: fix several connection hang-up conditions - bpf: - avoid deadlock when using queue and stack maps from NMI - add override check to kprobe multi link attach - hsr: properly parse HSRv1 supervisor frames. - eth: igc: fix infinite initialization loop with early XDP redirect - eth: octeon_ep: fix tx dma unmap len values in SG - eth: hns3: fix GRE checksum offload issue" * tag 'net-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits) sfc: handle error pointers returned by rhashtable_lookup_get_insert_fast() igc: Expose tx-usecs coalesce setting to user octeontx2-pf: Do xdp_do_flush() after redirects. bnxt_en: Flush XDP for bnxt_poll_nitroa0()'s NAPI net: ena: Flush XDP packets on error. net/handshake: Fix memory leak in __sock_create() and sock_alloc_file() net: hinic: Fix warning-hinic_set_vlan_fliter() warn: variable dereferenced before check 'hwdev' netfilter: ipset: Fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP netfilter: nf_tables: fix memleak when more than 255 elements expired netfilter: nf_tables: disable toggling dormant table state more than once vxlan: Add missing entries to vxlan_get_size() net: rds: Fix possible NULL-pointer dereference team: fix null-ptr-deref when team device type is changed net: bridge: use DEV_STATS_INC() net: hns3: add 5ms delay before clear firmware reset irq source net: hns3: fix fail to delete tc flower rules during reset issue net: hns3: only enable unicast promisc when mac table full net: hns3: fix GRE checksum offload issue net: hns3: add cmdq check for vf periodic service task net: stmmac: fix incorrect rxq|txq_stats reference ...
2023-09-21PM: hibernate: Use __get_safe_page() rather than touching the listBrian Geffon1-4/+6
We found at least one situation where the safe pages list was empty and get_buffer() would gladly try to use a NULL pointer. Signed-off-by: Brian Geffon <[email protected]> Fixes: 8357376d3df2 ("[PATCH] swsusp: Improve handling of highmem") Cc: All applicable <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-09-21exit: add internal include file with helpersJens Axboe2-25/+37
Move struct wait_opts and waitid_info into kernel/exit.h, and include function declarations for the recently added helpers. Make them non-static as well. This is in preparation for adding a waitid operation through io_uring. With the abtracted helpers, this is now possible. Signed-off-by: Jens Axboe <[email protected]>
2023-09-21exit: add kernel_waitid_prepare() helperJens Axboe1-13/+25
Move the setup logic out of kernel_waitid(), and into a separate helper. No functional changes intended in this patch. Signed-off-by: Jens Axboe <[email protected]>
2023-09-21exit: move core of do_wait() into helperJens Axboe1-20/+31
Rather than have a maze of gotos, put the actual logic in __do_wait() and have do_wait() loop deal with waitqueue setup/teardown and whether to call __do_wait() again. No functional changes intended in this patch. Acked-by: Christian Brauner <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-09-21exit: abstract out should_wake helper for child_wait_callback()Jens Axboe1-6/+14
Abstract out the helper that decides if we should wake up following a wake_up() callback on our internal waitqueue. No functional changes intended in this patch. Acked-by: Christian Brauner <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-09-21futex: Add sys_futex_requeue()[email protected]2-0/+39
Finish off the 'simple' futex2 syscall group by adding sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are too numerous to fit into a regular syscall. As such, use struct futex_waitv to pass the 'source' and 'destination' futexes to the syscall. This syscall implements what was previously known as FUTEX_CMP_REQUEUE and uses {val, uaddr, flags} for source and {uaddr, flags} for destination. This design explicitly allows requeueing between different types of futex by having a different flags word per uaddr. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21futex: Add flags2 argument to futex_requeue()[email protected]3-10/+13
In order to support mixed size requeue, add a second flags argument to the internal futex_requeue() function. No functional change intended. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21futex: Propagate flags into get_futex_key()[email protected]5-15/+18
Instead of only passing FLAGS_SHARED as a boolean, pass down flags as a whole. No functional change intended. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21futex: Add sys_futex_wait()[email protected]4-55/+130
To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This syscall implements what was previously known as FUTEX_WAIT_BITSET except it uses 'unsigned long' for the value and bitmask arguments, takes timespec and clockid_t arguments for the absolute timeout and uses FUTEX2 flags. The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21futex: FLAGS_STRICT[email protected]3-11/+15
The current semantics for futex_wake() are a bit loose, specifically asking for 0 futexes to be woken actually gets you 1. Adding a !nr check to sys_futex_wake() makes that it would return 0 for unaligned futex words, because that check comes in the shared futex_wake() function. Adding the !nr check there, would affect the legacy sys_futex() semantics. Hence frob a flag :-( Suggested-by: André Almeida <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21futex: Add sys_futex_wake()[email protected]2-0/+31
To complement sys_futex_waitv() add sys_futex_wake(). This syscall implements what was previously known as FUTEX_WAKE_BITSET except it uses 'unsigned long' for the bitmask and takes FUTEX2 flags. The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21futex: Validate futex value against futex size[email protected]2-0/+13
Ensure the futex value fits in the given futex size. Since this adds a constraint to an existing syscall, it might possibly change behaviour. Currently the value would be truncated to a u32 and any high bits would get silently lost. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21futex: Flag conversion[email protected]3-20/+71
Futex has 3 sets of flags: - legacy futex op bits - futex2 flags - internal flags Add a few helpers to convert from the API flags into the internal flags. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: André Almeida <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21futex: Extend the FUTEX2 flags[email protected]1-2/+7
Add the definition for the missing but always intended extra sizes, and add a NUMA flag for the planned numa extention. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: André Almeida <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21futex: Clarify FUTEX2 flags[email protected]1-4/+3
sys_futex_waitv() is part of the futex2 series (the first and only so far) of syscalls and has a flags field per futex (as opposed to flags being encoded in the futex op). This new flags field has a new namespace, which unfortunately isn't super explicit. Notably it currently takes FUTEX_32 and FUTEX_PRIVATE_FLAG. Introduce the FUTEX2 namespace to clarify this Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: André Almeida <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21printk: fix illegal pbufs access for !CONFIG_PRINTKJohn Ogness1-26/+18
When CONFIG_PRINTK is not set, PRINTK_MESSAGE_MAX is 0. This leads to a zero-sized array @outbuf in @printk_shared_pbufs. In console_flush_all() a pointer to the first element of the array is assigned with: char *outbuf = &printk_shared_pbufs.outbuf[0]; For !CONFIG_PRINTK this leads to a compiler warning: warning: array subscript 0 is outside array bounds of 'char[0]' [-Warray-bounds] This is not really dangerous because printk_get_next_message() always returns false for !CONFIG_PRINTK, which leads to @outbuf never being used. However, it makes no sense to even compile these functions for !CONFIG_PRINTK. Extend the existing '#ifdef CONFIG_PRINTK' block to contain the formatting and emitting functions since these have no purpose in !CONFIG_PRINTK. This also allows removing several more !CONFIG_PRINTK dummies as well as moving @suppress_panic_printk into a CONFIG_PRINTK block. Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: John Ogness <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21sched/debug: Update stale reference to sched_debug.cSebastian Andrzej Siewior1-1/+1
Since commit: 8a99b6833c884 ("sched: Move SCHED_DEBUG sysctl to debugfs") The sched_debug interface moved from /proc to debugfs. The comment mentions still the outdated proc interfaces. Update the comment, point to the current location of the interface. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-21sched/debug: Remove the /proc/sys/kernel/sched_child_runs_first sysctlSebastian Andrzej Siewior3-16/+0
The /proc/sys/kernel/sched_child_runs_first knob is no longer connected since: 5e963f2bd4654 ("sched/fair: Commit to EEVDF") Remove it. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-20bpf: unconditionally reset backtrack_state masks on global func exitAndrii Nakryiko1-5/+3
In mark_chain_precision() logic, when we reach the entry to a global func, it is expected that R1-R5 might be still requested to be marked precise. This would correspond to some integer input arguments being tracked as precise. This is all expected and handled as a special case. What's not expected is that we'll leave backtrack_state structure with some register bits set. This is because for subsequent precision propagations backtrack_state is reused without clearing masks, as all code paths are carefully written in a way to leave empty backtrack_state with zeroed out masks, for speed. The fix is trivial, we always clear register bit in the register mask, and then, optionally, set reg->precise if register is SCALAR_VALUE type. Reported-by: Chris Mason <[email protected]> Fixes: be2ef8161572 ("bpf: allow precision tracking for programs with subprogs") Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-09-20livepatch: Fix missing newline character in klp_resolve_symbols()Zheng Yejian1-1/+1
Without the newline character, the log may not be printed immediately after the error occurs. Fixes: ca376a937486 ("livepatch: Prevent module-specific KLP rela sections from referencing vmlinux symbols") Signed-off-by: Zheng Yejian <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-20futex/pi: Fix recursive rt_mutex waiter statePeter Zijlstra2-30/+52
Some new assertions pointed out that the existing code has nested rt_mutex wait state in the futex code. Specifically, the futex_lock_pi() cancel case uses spin_lock() while there still is a rt_waiter enqueued for this task, resulting in a state where there are two waiters for the same task (and task_struct::pi_blocked_on gets scrambled). The reason to take hb->lock at this point is to avoid the wake_futex_pi() EAGAIN case. This happens when futex_top_waiter() and rt_mutex_top_waiter() state becomes inconsistent. The current rules are such that this inconsistency will not be observed. Notably the case that needs to be avoided is where futex_lock_pi() and futex_unlock_pi() interleave such that unlock will fail to observe a new waiter. *However* the case at hand is where a waiter is leaving, in this case the race means a waiter that is going away is not observed -- which is harmless, provided this race is explicitly handled. This is a somewhat dangerous proposition because the converse race is not observing a new waiter, which must absolutely not happen. But since the race is valid this cannot be asserted. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Sebastian Andrzej Siewior <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-09-20locking/rtmutex: Add a lockdep assert to catch potential nested blockingThomas Gleixner3-0/+6
There used to be a BUG_ON(current->pi_blocked_on) in the lock acquisition functions, but that vanished in one of the rtmutex overhauls. Bring it back in form of a lockdep assert to catch code paths which take rtmutex based locks with current::pi_blocked_on != NULL. Reported-by: Crystal Wood <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: "Peter Zijlstra (Intel)" <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-09-20locking/rtmutex: Use rt_mutex specific scheduler helpersSebastian Andrzej Siewior5-3/+40
Have rt_mutex use the rt_mutex specific scheduler helpers to avoid recursion vs rtlock on the PI state. [[ peterz: adapted to new names ]] Reported-by: Crystal Wood <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-09-20sched: Provide rt_mutex specific scheduler helpersPeter Zijlstra1-4/+32
With PREEMPT_RT there is a rt_mutex recursion problem where sched_submit_work() can use an rtlock (aka spinlock_t). More specifically what happens is: mutex_lock() /* really rt_mutex */ ... __rt_mutex_slowlock_locked() task_blocks_on_rt_mutex() // enqueue current task as waiter // do PI chain walk rt_mutex_slowlock_block() schedule() sched_submit_work() ... spin_lock() /* really rtlock */ ... __rt_mutex_slowlock_locked() task_blocks_on_rt_mutex() // enqueue current task as waiter *AGAIN* // *CONFUSION* Fix this by making rt_mutex do the sched_submit_work() early, before it enqueues itself as a waiter -- before it even knows *if* it will wait. [[ basically Thomas' patch but with different naming and a few asserts added ]] Originally-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-09-20sched: Extract __schedule_loop()Thomas Gleixner1-10/+11
There are currently two implementations of this basic __schedule() loop, and there is soon to be a third. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-09-20locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXESSebastian Andrzej Siewior2-2/+21
With DEBUG_RT_MUTEXES enabled the fast-path rt_mutex_cmpxchg_acquire() always fails and all lock operations take the slow path. Provide a new helper inline rt_mutex_try_acquire() which maps to rt_mutex_cmpxchg_acquire() in the non-debug case. For the debug case it invokes rt_mutex_slowtrylock() which can acquire a non-contended rtmutex under full debug coverage. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-09-20sched: Constrain locks in sched_submit_work()Peter Zijlstra1-0/+9
Even though sched_submit_work() is ran from preemptible context, it is discouraged to have it use blocking locks due to the recursion potential. Enforce this. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-09-19pidfd: prevent a kernel-doc warningRandy Dunlap1-1/+1
Change the comment to match the function name that the SYSCALL_DEFINE() macros generate to prevent a kernel-doc warning. kernel/pid.c:628: warning: expecting prototype for pidfd_open(). Prototype was for sys_pidfd_open() instead Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Cc: Christian Brauner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-09-19task_work: add kerneldoc annotation for 'data' argumentJens Axboe1-0/+1
A previous commit changed the arguments to task_work_cancel_match(), but didn't document all of them. Link: https://lkml.kernel.org/r/[email protected] Fixes: c7aab1a7c52b ("task_work: add helper for more targeted task_work canceling") Signed-off-by: Jens Axboe <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Acked-by: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-09-19signal: Don't disable preemption in ptrace_stop() on PREEMPT_RTSebastian Andrzej Siewior1-2/+13
On PREEMPT_RT keeping preemption disabled during the invocation of cgroup_enter_frozen() is a problem because the function acquires css_set_lock which is a sleeping lock on PREEMPT_RT and must not be acquired with disabled preemption. The preempt-disabled section is only for performance optimisation reasons and can be avoided. Extend the comment and don't disable preemption before scheduling on PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-19signal: Add a proper comment about preempt_disable() in ptrace_stop()Sebastian Andrzej Siewior1-3/+15
Commit 53da1d9456fe7 ("fix ptrace slowness") added a preempt-disable section between read_unlock() and the following schedule() invocation without explaining why it is needed. Replace the existing contentless comment with a proper explanation to clarify that it is not needed for correctness but for performance reasons. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-19bpf: Remove unused variables.Alexei Starovoitov1-5/+1
Remove unused prev_offset, min_size, krec_size variables. Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Fixes: aaa619ebccb2 ("bpf: Refactor check_btf_func and split into two phases") Signed-off-by: Alexei Starovoitov <[email protected]>
2023-09-19bpf: Fix bpf_throw warning on 32-bit archKumar Kartikeya Dwivedi1-1/+1
On 32-bit architectures, the pointer width is 32-bit, while we try to cast from a u64 down to it, the compiler complains on mismatch in integer size. Fix this by first casting to long which should match the pointer width on targets supported by Linux. Fixes: ec5290a178b7 ("bpf: Prevent KASAN false positive with bpf_throw") Reported-by: Matthieu Baerts <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Tested-by: Matthieu Baerts <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-09-19kernel/sched: Modify initial boot task idle setupLiam R. Howlett2-1/+2
Initial booting is setting the task flag to idle (PF_IDLE) by the call path sched_init() -> init_idle(). Having the task idle and calling call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be set. Subsequent calls to any cond_resched() will enable IRQs, potentially earlier than the IRQ setup has completed. Recent changes have caused just this scenario and IRQs have been enabled early. This causes a warning later in start_kernel() as interrupts are enabled before they are fully set up. Fix this issue by setting the PF_IDLE flag later in the boot sequence. Although the boot task was marked as idle since (at least) d80e4fda576d, I am not sure that it is wrong to do so. The forced context-switch on idle task was introduced in the tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee. Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task") Signed-off-by: Liam R. Howlett <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
2023-09-19sched/fair: Rename check_preempt_curr() to wakeup_preempt()Ingo Molnar7-26/+26
The name is a bit opaque - make it clear that this is about wakeup preemption. Also rename the ->check_preempt_curr() methods similarly. Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2023-09-19sched/fair: Rename check_preempt_wakeup() to check_preempt_wakeup_fair()Ingo Molnar1-2/+2
Other scheduling classes already postfix their similar methods with the class name. Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>