aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-05-28tracing: Only make selftest conditionals affect the global_traceSteven Rostedt (Google)1-2/+8
The tracing_selftest_running and tracing_selftest_disabled variables were to keep trace_printk() and other writes from affecting the tracing selftests, as the tracing selftests would examine the ring buffer to see if it contained what it expected or not. trace_printk() and friends could add to the ring buffer and cause the selftests to fail (and then disable the tracer that was being tested). To keep that from happening, these variables were added and would keep trace_printk() and friends from writing to the ring buffer while the tests were going on. But this was only the top level ring buffer (owned by the global_trace instance). There is no reason to prevent writing into ring buffers of other instances via the trace_array_printk() and friends. For the functions that could be used by other instances, check if the global_trace is the tracer instance that is being written to before deciding to not allow the write. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-28tracing: Make tracing_selftest_running/delete nops when not usedSteven Rostedt (Google)1-1/+4
There's no reason to test the condition variables tracing_selftest_running or tracing_selftest_delete when tracing selftests are not enabled. Make them define 0s when not the selftests are not configured in. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-28tracing: Have tracer selftests call cond_resched() before runningSteven Rostedt (Google)1-0/+7
As there are more and more internal selftests being added to the Linux kernel (KSAN, lockdep, etc) the selftests are taking longer to run when these are enabled. Add a cond_resched() to the calling of do_run_tracer_selftest() to force a schedule if NEED_RESCHED is set, otherwise the soft lockup watchdog may trigger on boot up. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-28tracing: Move setting of tracing_selftest_running out of register_tracer()Steven Rostedt (Google)1-4/+16
The variables tracing_selftest_running and tracing_selftest_disabled are only used for when CONFIG_FTRACE_STARTUP_TEST is enabled. Make them only visible within the selftest code. The setting of those variables are in the register_tracer() call, and set in a location where they do not need to be. Create a wrapper around run_tracer_selftest() called do_run_tracer_selftest() which sets those variables, and have register_tracer() call that instead. Having those variables only set within the CONFIG_FTRACE_STARTUP_TEST scope gets rid of them (and also the ability to remove testing against them) when the startup tests are not enabled (most cases). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-28Merge tag 'core-debugobjects-2023-05-28' of ↵Linus Torvalds1-7/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull debugobjects fixes from Thomas Gleixner: "Two fixes for debugobjects: - Prevent the allocation path from waking up kswapd. That's a long standing issue due to the GFP_ATOMIC allocation flag. As debug objects can be invoked from pretty much any context waking kswapd can end up in arbitrary lock chains versus the waitqueue lock - Correct the explicit lockdep wait-type violation in debug_object_fill_pool()" * tag 'core-debugobjects-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: debugobjects: Don't wake up kswapd from fill_pool() debugobjects,locking: Annotate debug_object_fill_pool() wait type violation
2023-05-28Revert "kheaders: substituting --sort in archive creation"Masahiro Yamada1-6/+3
This reverts commit 700dea5a0bea9f64eba89fae7cb2540326fdfdc1. The reason for that commit was --sort=ORDER introduced in tar 1.28 (2014). More than 3 years have passed since then. Requiring GNU tar 1.28 should be fine now because we require GCC 5.1 (2015). Signed-off-by: Masahiro Yamada <[email protected]> Reviewed-by: Nicolas Schier <[email protected]>
2023-05-27Merge tag 'for-linus-6.4-rc4-tag' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - a double free fix in the Xen pvcalls backend driver - a fix for a regression causing the MSI related sysfs entries to not being created in Xen PV guests - a fix in the Xen blkfront driver for handling insane input data better * tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/pci/xen: populate MSI sysfs entries xen/pvcalls-back: fix double frees with pvcalls_new_active_socket() xen/blkfront: Only check REQ_FUA for writes
2023-05-26Merge tag 'for-netdev' of ↵Jakub Kicinski5-42/+105
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-05-26 We've added 54 non-merge commits during the last 10 day(s) which contain a total of 76 files changed, 2729 insertions(+), 1003 deletions(-). The main changes are: 1) Add the capability to destroy sockets in BPF through a new kfunc, from Aditi Ghag. 2) Support O_PATH fds in BPF_OBJ_PIN and BPF_OBJ_GET commands, from Andrii Nakryiko. 3) Add capability for libbpf to resize datasec maps when backed via mmap, from JP Kobryn. 4) Move all the test kfuncs for CI out of the kernel and into bpf_testmod, from Jiri Olsa. 5) Big batch of xsk selftest improvements to prep for multi-buffer testing, from Magnus Karlsson. 6) Show the target_{obj,btf}_id in tracing link's fdinfo and dump it via bpftool, from Yafang Shao. 7) Various misc BPF selftest improvements to work with upcoming LLVM 17, from Yonghong Song. 8) Extend bpftool to specify netdevice for resolving XDP hints, from Larysa Zaremba. 9) Document masking in shift operations for the insn set document, from Dave Thaler. 10) Extend BPF selftests to check xdp_feature support for bond driver, from Lorenzo Bianconi. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits) bpf: Fix bad unlock balance on freeze_mutex libbpf: Ensure FD >= 3 during bpf_map__reuse_fd() libbpf: Ensure libbpf always opens files with O_CLOEXEC selftests/bpf: Check whether to run selftest libbpf: Change var type in datasec resize func bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command libbpf: Selftests for resizing datasec maps libbpf: Add capability for resizing datasec maps selftests/bpf: Add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests libbpf: Add opts-based bpf_obj_pin() API and add support for path_fd bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands libbpf: Start v1.3 development cycle bpf: Validate BPF object in BPF_OBJ_PIN before calling LSM bpftool: Specify XDP Hints ifname when loading program selftests/bpf: Add xdp_feature selftest for bond device selftests/bpf: Test bpf_sock_destroy selftests/bpf: Add helper to get port using getsockname bpf: Add bpf_sock_destroy kfunc bpf: Add kfunc filter function to 'struct btf_kfunc_id_set' bpf: udp: Implement batching for sockets iterator ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-05-26kallsyms: remove unsed API lookup_symbol_attrsManinder Singh2-56/+0
with commit '7878c231dae0 ("slab: remove /proc/slab_allocators")' lookup_symbol_attrs usage is removed. Thus removing redundant API. Signed-off-by: Maninder Singh <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2023-05-26tracing: Replace all non-returning strlcpy with strscpyAzeem Shaikh5-10/+10
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement with strlcpy is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-05-26kallsyms: remove unused arch_get_kallsym() helperArnd Bergmann1-27/+1
The arch_get_kallsym() function was introduced so that x86 could override it, but that override was removed in bf904d2762ee ("x86/pti/64: Remove the SYSCALL64 entry trampoline"), so now this does nothing except causing a warning about a missing prototype: kernel/kallsyms.c:662:12: error: no previous prototype for 'arch_get_kallsym' [-Werror=missing-prototypes] 662 | int __weak arch_get_kallsym(unsigned int symnum, unsigned long *value, Restore the old behavior before d83212d5dd67 ("kallsyms, x86: Export addresses of PTI entry trampolines") to simplify the code and avoid the warning. Signed-off-by: Arnd Bergmann <[email protected]> Tested-by: Alan Maguire <[email protected]> [mcgrof: fold in bpf selftest fix] Signed-off-by: Luis Chamberlain <[email protected]>
2023-05-26mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATEDVlastimil Babka1-1/+0
As discussed at LSF/MM [1] [2] and with no objections raised there, deprecate the SLAB allocator. Rename the user-visible option so that users with CONFIG_SLAB=y get a new prompt with explanation during make oldconfig, while make olddefconfig will just switch to SLUB. In all defconfigs with CONFIG_SLAB=y remove the line so those also switch to SLUB. Regressions due to the switch should be reported to linux-mm and slab maintainers. [1] https://lore.kernel.org/all/[email protected]/ [2] https://lwn.net/Articles/932201/ Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Hyeonggon Yoo <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> # m68k Acked-by: Helge Deller <[email protected]> # parisc
2023-05-26bpf: Fix bad unlock balance on freeze_mutexDaniel Borkmann1-2/+2
Commit c4c84f6fb2c4 ("bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command") moved the permissions check outside of the freeze_mutex in the map_freeze() handler. The error paths still jumps to the err_put which tries to unlock the freeze_mutex even though it was not locked in the first place. Fix it. Fixes: c4c84f6fb2c4 ("bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command") Reported-by: [email protected] Signed-off-by: Daniel Borkmann <[email protected]>
2023-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-18/+63
Cross-merge networking fixes after downstream PR. Signed-off-by: Jakub Kicinski <[email protected]>
2023-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-4/+6
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/raw.c 3632679d9e4f ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol") c85be08fc4fa ("raw: Stop using RTO_ONLINK.") https://lore.kernel.org/all/[email protected]/ Adjacent changes: drivers/net/ethernet/freescale/fec_main.c 9025944fddfe ("net: fec: add dma_wmb to ensure correct descriptor values") 144470c88c5d ("net: fec: using the standard return codes when xdp xmit errors") Signed-off-by: Jakub Kicinski <[email protected]>
2023-05-25module: error out early on concurrent load of the same module fileLinus Torvalds1-15/+43
It turns out that udev under certain circumstances will concurrently try to load the same modules over-and-over excessively. This isn't a kernel bug, but it ends up affecting the kernel, to the point that under certain circumstances we can fail to boot, because the kernel uses a lot of memory to read all the module data all at once. Note that it isn't a memory leak, it's just basically a thundering herd problem happening at bootup with a lot of CPUs, with the worst cases then being pretty bad. Admittedly the worst situations are somewhat contrived: lots and lots of CPUs, not a lot of memory, and KASAN enabled to make it all slower and as such (unintentionally) exacerbate the problem. Luis explains: [1] "My best assessment of the situation is that each CPU in udev ends up triggering a load of duplicate set of modules, not just one, but *a lot*. Not sure what heuristics udev uses to load a set of modules per CPU." Petr Pavlu chimes in: [2] "My understanding is that udev workers are forked. An initial kmod context is created by the main udevd process but no sharing happens after the fork. It means that the mentioned memory pool logic doesn't really kick in. Multiple parallel load requests come from multiple udev workers, for instance, each handling an udev event for one CPU device and making the exactly same requests as all others are doing at the same time. The optimization idea would be to recognize these duplicate requests at the udevd/kmod level and converge them" Note that module loading has tried to mitigate this issue before, see for example commit 064f4536d139 ("module: avoid allocation if module is already present and ready"), which has a few ASCII graphs on memory use due to this same issue. However, while that noticed that the module was already loaded, and exited with an error early before spending any more time on setting up the module, it didn't handle the case of multiple concurrent module loads all being active - but not complete - at the same time. Yes, one of them will eventually win the race and finalize its copy, and the others will then notice that the module already exists and error out, but while this all happens, we have tons of unnecessary concurrent work being done. Again, the real fix is for udev to not do that (maybe it should use threads instead of fork, and have actual shared data structures and not cause duplicate work). That real fix is apparently not trivial. But it turns out that the kernel already has a pretty good model for dealing with concurrent access to the same file: the i_writecount of the inode. In fact, the module loading already indirectly uses 'i_writecount' , because 'kernel_file_read()' will in fact do ret = deny_write_access(file); if (ret) return ret; ... allow_write_access(file); around the read of the file data. We do not allow concurrent writes to the file, and return -ETXTBUSY if the file was open for writing at the same time as the module data is loaded from it. And the solution to the reader concurrency problem is to simply extend this "no concurrent writers" logic to simply be "exclusive access". Note that "exclusive" in this context isn't really some absolute thing: it's only exclusion from writers and from other "special readers" that do this writer denial. So we simply introduce a variation of that "deny_write_access()" logic that not only denies write access, but also requires that this is the _only_ such access that denies write access. Which means that you can't start loading a module that is already being loaded as a module by somebody else, or you will get the same -ETXTBSY error that you would get if there were writers around. [ It also means that you can't try to load a currently executing executable as a module, for the same reason: executables do that same "deny_write_access()" thing, and that's obviously where the whole ETXTBSY logic traditionally came from. This is not a problem for kernel modules, since the set of normal executable files and kernel module files is entirely disjoint. ] This new function is called "exclusive_deny_write_access()", and the implementation is trivial, in that it's just an atomic decrement of i_writecount if it was 0 before. To use that new exclusivity check, all we then do is wrap the module loading with that exclusive_deny_write_access()() / allow_write_access() pair. The actual patch is a bit bigger than that, because we want to surround not just the "load file data" part, but the whole module setup, to get maximum exclusion. So this ends up splitting up "finit_module()" into a few helper functions to make it all very clear and legible. In Luis' test-case (bringing up 255 vcpu's in a virtual machine [3]), the "wasted vmalloc" space (ie module data read into a vmalloc'ed area in order to be loaded as a module, but then discarded because somebody else loaded the same module instead) dropped from 1.8GiB to 474kB. Yes, that's gigabytes to kilobytes. It doesn't drop completely to zero, because even with this change, you can still end up having completely serial pointless module loads, where one udev process has loaded a module fully (and thus the kernel has released that exclusive lock on the module file), and then another udev process tries to load the same module again. So while we cannot fully get rid of the fundamental bug in user space, we _can_ get rid of the excessive concurrent thundering herd effect. A couple of final side notes on this all: - This tweak only affects the "finit_module()" system call, which gives the kernel a file descriptor with the module data. You can also just feed the module data as raw data from user space with "init_module()" (note the lack of 'f' at the beginning), and obviously for that case we do _not_ have any "exclusive read" logic. So if you absolutely want to do things wrong in user space, and try to load the same module multiple times, and error out only later when the kernel ends up saying "you can't load the same module name twice", you can still do that. And in fact, some distros will do exactly that, because they will uncompress the kernel module data in user space before feeding it to the kernel (mainly because they haven't started using the new kernel side decompression yet). So this is not some absolute "you can't do concurrent loads of the same module". It's literally just a very simple heuristic that will catch it early in case you try to load the exact same module file at the same time, and in that case avoid a potentially nasty situation. - There is another user of "deny_write_access()": the verity code that enables fs-verity on a file (the FS_IOC_ENABLE_VERITY ioctl). If you use fs-verity and you care about verifying the kernel modules (which does make sense), you should do it *before* loading said kernel module. That may sound obvious, but now the implementation basically requires it. Because if you try to do it concurrently, the kernel may refuse to load the module file that is being set up by the fs-verity code. - This all will obviously mean that if you insist on loading the same module in parallel, only one module load will succeed, and the others will return with an error. That was true before too, but what is different is that the -ETXTBSY error can be returned *before* the success case of another process fully loading and instantiating the module. Again, that might sound obvious, and it is indeed the whole point of the whole change: we are much quicker to notice the whole "you're already in the process of loading this module". So it's very much intentional, but it does mean that if you just spray the kernel with "finit_module()", and expect that the module is immediately loaded afterwards without checking the return value, you are doing something horribly horribly wrong. I'd like to say that that would never happen, but the whole _reason_ for this commit is that udev is currently doing something horribly horribly wrong, so ... Link: https://lore.kernel.org/all/[email protected]/ [1] Link: https://lore.kernel.org/all/[email protected]/ [2] Link: https://lore.kernel.org/lkml/ZG%2Fa+nrt4%[email protected]/ [3] Cc: Greg Kroah-Hartman <[email protected]> Cc: Lucas De Marchi <[email protected]> Cc: Petr Pavlu <[email protected]> Tested-by: Luis Chamberlain <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2023-05-25workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us ↵Zqiang1-0/+3
is 0 If workqueue.cpu_intensive_thresh_us is set to 0, the detection mechanism for CPU-hogging per-cpu work item will keep triggering spuriously: workqueue: process_srcu hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: gc_worker hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: gc_worker hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND workqueue: wait_rcu_exp_gp hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: kfree_rcu_monitor hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND workqueue: kfree_rcu_monitor hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND workqueue: reg_todo hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND This commit therefore disables the CPU-hog detection mechanism when workqueue.cpu_intensive_thresh_us is set to 0. tj: Patch description updated and the condition check on cpu_intensive_thresh_us separated into a separate if statement for readability. Signed-off-by: Zqiang <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-05-25Merge tag 'net-6.4-rc4' of ↵Linus Torvalds3-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bluetooth and bpf. Current release - regressions: - net: fix skb leak in __skb_tstamp_tx() - eth: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs Current release - new code bugs: - handshake: - fix sock->file allocation - fix handshake_dup() ref counting - bluetooth: - fix potential double free caused by hci_conn_unlink - fix UAF in hci_conn_hash_flush Previous releases - regressions: - core: fix stack overflow when LRO is disabled for virtual interfaces - tls: fix strparser rx issues - bpf: - fix many sockmap/TCP related issues - fix a memory leak in the LRU and LRU_PERCPU hash maps - init the offload table earlier - eth: mlx5e: - do as little as possible in napi poll when budget is 0 - fix using eswitch mapping in nic mode - fix deadlock in tc route query code Previous releases - always broken: - udplite: fix NULL pointer dereference in __sk_mem_raise_allocated() - raw: fix output xfrm lookup wrt protocol - smc: reset connection when trying to use SMCRv2 fails - phy: mscc: enable VSC8501/2 RGMII RX clock - eth: octeontx2-pf: fix TSOv6 offload - eth: cdc_ncm: deal with too low values of dwNtbOutMaxSize" * tag 'net-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits) udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated(). net: phy: mscc: enable VSC8501/2 RGMII RX clock net: phy: mscc: remove unnecessary phydev locking net: phy: mscc: add support for VSC8501 net: phy: mscc: add VSC8502 to MODULE_DEVICE_TABLE net/handshake: Enable the SNI extension to work properly net/handshake: Unpin sock->file if a handshake is cancelled net/handshake: handshake_genl_notify() shouldn't ignore @flags net/handshake: Fix uninitialized local variable net/handshake: Fix handshake_dup() ref counting net/handshake: Remove unneeded check from handshake_dup() ipv6: Fix out-of-bounds access in ipv6_find_tlv() net: ethernet: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs docs: netdev: document the existence of the mail bot net: fix skb leak in __skb_tstamp_tx() r8169: Use a raw_spinlock_t for the register locks. page_pool: fix inconsistency for page_pool_ring_[un]lock() bpf, sockmap: Test progs verifier error with latest clang bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer ...
2023-05-25bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE commandAndrii Nakryiko1-4/+5
Seems like that extra bpf_capable() check in BPF_MAP_FREEZE handler was unintentionally left when we switched to a model that all BPF map operations should be allowed regardless of CAP_BPF (or any other capabilities), as long as process got BPF map FD somehow. This patch replaces bpf_capable() check in BPF_MAP_FREEZE handler with writeable access check, given conceptually freezing the map is modifying it: map becomes unmodifiable for subsequent updates. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-05-24Merge tag 'for-netdev' of ↵Jakub Kicinski3-4/+6
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-05-24 We've added 19 non-merge commits during the last 10 day(s) which contain a total of 20 files changed, 738 insertions(+), 448 deletions(-). The main changes are: 1) Batch of BPF sockmap fixes found when running against NGINX TCP tests, from John Fastabend. 2) Fix a memleak in the LRU{,_PERCPU} hash map when bucket locking fails, from Anton Protopopov. 3) Init the BPF offload table earlier than just late_initcall, from Jakub Kicinski. 4) Fix ctx access mask generation for 32-bit narrow loads of 64-bit fields, from Will Deacon. 5) Remove a now unsupported __fallthrough in BPF samples, from Andrii Nakryiko. 6) Fix a typo in pkg-config call for building sign-file, from Jeremy Sowden. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, sockmap: Test progs verifier error with latest clang bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0 bpf, sockmap: Build helper to create connected socket pair bpf, sockmap: Pull socket helpers out of listen test for general use bpf, sockmap: Incorrectly handling copied_seq bpf, sockmap: Wake up polling after data copy bpf, sockmap: TCP data stall on recv before accept bpf, sockmap: Handle fin correctly bpf, sockmap: Improved check for empty queue bpf, sockmap: Reschedule is now done through backlog bpf, sockmap: Convert schedule_work into delayed_work bpf, sockmap: Pass skb ownership through read_skb bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields samples/bpf: Drop unnecessary fallthrough bpf: netdev: init the offload table earlier selftests/bpf: Fix pkg-config call building sign-file ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-05-24cgroup: Update out-of-date comment in cgroup_migrate()Xiu Jianfeng1-3/+3
Commit 674b745e22b3 ("cgroup: remove rcu_read_lock()/rcu_read_unlock() in critical section of spin_lock_irq()") has removed the rcu_read_lock, which makes the comment out-of-date, so update it. tj: Updated the comment a bit. Signed-off-by: Xiu Jianfeng <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-05-24workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()Zqiang1-5/+12
Currently, pool->nr_running can be modified from timer tick, that means the timer tick can run nested inside a not-irq-protected section that's in the process of modifying nr_running. Consider the following scenario: CPU0 kworker/0:2 (events) worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); ->pool->nr_running++; (1) process_one_work() ->worker->current_func(work); ->schedule() ->wq_worker_sleeping() ->worker->sleeping = 1; ->pool->nr_running--; (0) .... ->wq_worker_running() .... CPU0 by interrupt: wq_worker_tick() ->worker_set_flags(worker, WORKER_CPU_INTENSIVE); ->pool->nr_running--; (-1) ->worker->flags |= WORKER_CPU_INTENSIVE; .... ->if (!(worker->flags & WORKER_NOT_RUNNING)) ->pool->nr_running++; (will not execute) ->worker->sleeping = 0; .... ->worker_clr_flags(worker, WORKER_CPU_INTENSIVE); ->pool->nr_running++; (0) .... worker_set_flags(worker, WORKER_PREP); ->pool->nr_running--; (-1) .... worker_enter_idle() ->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running); if the nr_workers is equal to nr_idle, due to the nr_running is not zero, will trigger WARN_ON_ONCE(). [ 2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0 [ 2.462163] Modules linked in: [ 2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1 [ 2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 [ 2.465127] Workqueue: 0x0 (events) [ 2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0 ... [ 2.472614] Call Trace: [ 2.473152] <TASK> [ 2.474182] worker_thread+0x71/0x430 [ 2.474992] ? _raw_spin_unlock_irqrestore+0x28/0x50 [ 2.475263] kthread+0x103/0x120 [ 2.475493] ? __pfx_worker_thread+0x10/0x10 [ 2.476355] ? __pfx_kthread+0x10/0x10 [ 2.476635] ret_from_fork+0x2c/0x50 [ 2.477051] </TASK> This commit therefore add the check of worker->sleeping in wq_worker_tick(), if the worker->sleeping is not zero, directly return. tj: Updated comment and description. Reported-by: Naresh Kamboju <[email protected]> Reported-by: Linux Kernel Functional Testing <[email protected]> Tested-by: Anders Roxell <[email protected]> Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log Signed-off-by: Zqiang <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-05-24PM: hibernate: Correct spelling mistake in a commentWang Honghui1-1/+1
Fix a typo in a comment in kernel/power/snapshot.c Signed-off-by: Wang Honghui <[email protected]> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-05-24x86/pci/xen: populate MSI sysfs entriesMaximilian Heyne1-2/+2
Commit bf5e758f02fc ("genirq/msi: Simplify sysfs handling") reworked the creation of sysfs entries for MSI IRQs. The creation used to be in msi_domain_alloc_irqs_descs_locked after calling ops->domain_alloc_irqs. Then it moved into __msi_domain_alloc_irqs which is an implementation of domain_alloc_irqs. However, Xen comes with the only other implementation of domain_alloc_irqs and hence doesn't run the sysfs population code anymore. Commit 6c796996ee70 ("x86/pci/xen: Fixup fallout from the PCI/MSI overhaul") set the flag MSI_FLAG_DEV_SYSFS for the xen msi_domain_info but that doesn't actually have an effect because Xen uses it's own domain_alloc_irqs implementation. Fix this by making use of the fallback functions for sysfs population. Fixes: bf5e758f02fc ("genirq/msi: Simplify sysfs handling") Signed-off-by: Maximilian Heyne <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>
2023-05-24trace: Convert trace/seq to use copy_splice_read()David Howells1-1/+1
For the splice from the trace seq buffer, just use copy_splice_read(). In the future, something better can probably be done by gifting pages from seq->buf into the pipe, but that would require changing seq->buf into a vmap over an array of pages. Signed-off-by: David Howells <[email protected]> cc: Christoph Hellwig <[email protected]> cc: Al Viro <[email protected]> cc: Jens Axboe <[email protected]> cc: Steven Rostedt <[email protected]> cc: Masami Hiramatsu <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-05-24genirq: Use a maple tree for interrupt descriptor managementShanker Donthineni2-26/+33
The current implementation uses a static bitmap for interrupt descriptor allocation and a radix tree to pointer store the pointer for lookup. However, the size of the bitmap is constrained by the build time macro MAX_SPARSE_IRQS, which may not be sufficient to support high-end servers, particularly those with GICv4.1 hardware, which require a large interrupt space to cover LPIs and vSGIs. Replace the bitmap and the radix tree with a maple tree, which not only stores pointers for lookup, but also provides a mechanism to find free ranges. That removes the build time hardcoded upper limit. Signed-off-by: Shanker Donthineni <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-05-24genirq: Encapsulate sparse bitmap handlingShanker Donthineni2-12/+22
Move the open coded sparse bitmap handling into helper functions as a preparatory step for converting the sparse interrupt management to a maple tree. No functional change. Signed-off-by: Shanker Donthineni <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-05-24genirq: Use hlist for managing resend handlersShanker Donthineni4-17/+35
The current implementation utilizes a bitmap for managing interrupt resend handlers, which is allocated based on the SPARSE_IRQ/NR_IRQS macros. However, this method may not efficiently utilize memory during runtime, particularly when IRQ_BITMAP_BITS is large. Address this issue by using an hlist to manage interrupt resend handlers instead of relying on a static bitmap memory allocation. Additionally, a new function, clear_irq_resend(), is introduced and called from irq_shutdown to ensure a graceful teardown of the interrupt. Signed-off-by: Shanker Donthineni <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-05-23module: Remove preempt_disable() from module reference counting.Sebastian Andrzej Siewior1-7/+0
The preempt_disable() section in module_put() was added in commit e1783a240f491 ("module: Use this_cpu_xx to dynamically allocate counters") while the per-CPU counter were switched to another API. The API requires that during the RMW operation the CPU remained the same. This counting API was later replaced with atomic_t in commit 2f35c41f58a97 ("module: Replace module_ref with atomic_t refcnt") Since this atomic_t replacement there is no need to keep preemption disabled while the reference counter is modified. Remove preempt_disable() from module_put(), __module_get() and try_module_get(). Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2023-05-23sysctl: Refactor base paths registrationsJoel Granados1-21/+9
This is part of the general push to deprecate register_sysctl_paths and register_sysctl_table. The old way of doing this through register_sysctl_base and DECLARE_SYSCTL_BASE macro is replaced with a call to register_sysctl_init. The 5 base paths affected are: "kernel", "vm", "debug", "dev" and "fs". We remove the register_sysctl_base function and the DECLARE_SYSCTL_BASE macro since they are no longer needed. In order to quickly acertain that the paths did not actually change I executed `find /proc/sys/ | sha1sum` and made sure that the sha was the same before and after the commit. We end up saving 563 bytes with this change: ./scripts/bloat-o-meter vmlinux.0.base vmlinux.1.refactor-base-paths add/remove: 0/5 grow/shrink: 2/0 up/down: 77/-640 (-563) Function old new delta sysctl_init_bases 55 111 +56 init_fs_sysctls 12 33 +21 vm_base_table 128 - -128 kernel_base_table 128 - -128 fs_base_table 128 - -128 dev_base_table 128 - -128 debug_base_table 128 - -128 Total: Before=21258215, After=21257652, chg -0.00% [mcgrof: modified to use register_sysctl_init() over register_sysctl() and add bloat-o-meter stats] Signed-off-by: Joel Granados <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]> Tested-by: Stephen Rothwell <[email protected]> Acked-by: Christian Brauner <[email protected]>
2023-05-23tracing: Rename stacktrace field to common_stacktraceSteven Rostedt (Google)3-7/+13
The histogram and synthetic events can use a pseudo event called "stacktrace" that will create a stacktrace at the time of the event and use it just like it was a normal field. We have other pseudo events such as "common_cpu" and "common_timestamp". To stay consistent with that, convert "stacktrace" to "common_stacktrace". As this was used in older kernels, to keep backward compatibility, this will act just like "common_cpu" did with "cpu". That is, "cpu" will be the same as "common_cpu" unless the event has a "cpu" field. In which case, the event's field is used. The same is true with "stacktrace". Also update the documentation to reflect this change. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Tom Zanussi <[email protected]> Cc: Mark Rutland <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-23tracing/histograms: Allow variables to have some modifiersSteven Rostedt (Google)1-7/+16
Modifiers are used to change the behavior of keys. For instance, they can grouped into buckets, converted to syscall names (from the syscall identifier), show task->comm of the current pid, be an array of longs that represent a stacktrace, and more. It was found that nothing stopped a value from taking a modifier. As values are simple counters. If this happened, it would call code that was not expecting a modifier and crash the kernel. This was fixed by having the ___create_val_field() function test if a modifier was present and fail if one was. This fixed the crash. Now there's a problem with variables. Variables are used to pass fields from one event to another. Variables are allowed to have some modifiers, as the processing may need to happen at the time of the event (like stacktraces and comm names of the current pid). The issue is that it too uses __create_val_field(). Now that fails on modifiers, variables can no longer use them (this is a regression). As not all modifiers are for variables, have them use a separate check. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Tom Zanussi <[email protected]> Cc: Mark Rutland <[email protected]> Fixes: e0213434fe3e4 ("tracing: Do not let histogram values have some modifiers") Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-23tracing/user_events: Document user_event_mm one-shot list usageBeau Belgrave1-1/+22
During 6.4 development it became clear that the one-shot list used by the user_event_mm's next field was confusing to others. It is not clear how this list is protected or what the next field usage is for unless you are familiar with the code. Add comments into the user_event_mm struct indicating lock requirement and usage. Also document how and why this approach was used via comments in both user_event_enabler_update() and user_event_mm_get_all() and the rules to properly use it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/ Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-23tracing/user_events: Rename link fields for clarityBeau Belgrave1-18/+22
Currently most list_head fields of various structs within user_events are simply named link. This causes folks to keep additional context in their head when working with the code, which can be confusing. Instead of using link, describe what the actual link is, for example: list_del_rcu(&mm->link); Changes into: list_del_rcu(&mm->mms_link); The reader now is given a hint the link is to the mms global list instead of having to remember or spot check within the code. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/ Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-23tracing/user_events: Remove RCU lock while pinning pagesLinus Torvalds1-6/+7
pin_user_pages_remote() can reschedule which means we cannot hold any RCU lock while using it. Now that enablers are not exposed out to the tracing register callbacks during fork(), there is clearly no need to require the RCU lock as event_mutex is enough to protect changes. Remove unneeded RCU usages when pinning pages and walking enablers with event_mutex held. Cleanup a misleading "safe" list walk that is not needed. During fork() duplication, remove unneeded RCU list add, since the list is not exposed yet. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wiiBfT4zNS29jA0XEsy8EmbqTH1hAPdRJCDAJMD8Gxt5A@mail.gmail.com/ Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement") Signed-off-by: Linus Torvalds <[email protected]> [ change log written by Beau Belgrave ] Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-23tracing/user_events: Split up mm alloc and attachLinus Torvalds1-11/+18
When a new mm is being created in a fork() path it currently is allocated and then attached in one go. This leaves the mm exposed out to the tracing register callbacks while any parent enabler locations are copied in. This should not happen. Split up mm alloc and attach as unique operations. When duplicating enablers, first alloc, then duplicate, and only upon success, attach. This prevents any timing window outside of the event_reg mutex for enablement walking. This allows for dropping RCU requirement for enablement walking in later patches. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=whTBvXJuoi_kACo3qi5WZUmRrhyA-_=rRFsycTytmB6qw@mail.gmail.com/ Signed-off-by: Linus Torvalds <[email protected]> [ change log written by Beau Belgrave ] Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-23bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commandsAndrii Nakryiko2-13/+28
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall forces users to specify pinning location as a string-based absolute or relative (to current working directory) path. This has various implications related to security (e.g., symlink-based attacks), forces BPF FS to be exposed in the file system, which can cause races with other applications. One of the feedbacks we got from folks working with containers heavily was that inability to use purely FD-based location specification was an unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET commands. This patch closes this oversight, adding path_fd field to BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by *at() syscalls for dirfd + pathname combinations. This now allows interesting possibilities like working with detached BPF FS mount (e.g., to perform multiple pinnings without running a risk of someone interfering with them), and generally making pinning/getting more secure and not prone to any races and/or security attacks. This is demonstrated by a selftest added in subsequent patch that takes advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate creating detached BPF FS mount, pinning, and then getting BPF map out of it, all while never exposing this private instance of BPF FS to outside worlds. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-05-23cpu/hotplug: Fix off by one in cpuhp_bringup_mask()Thomas Gleixner1-3/+3
cpuhp_bringup_mask() iterates over a cpumask and starts all present CPUs up to a caller provided upper limit. The limit variable is decremented and checked for 0 before invoking cpu_up(), which is obviously off by one and prevents the bringup of the last CPU when the limit is equal to the number of present CPUs. Move the decrement and check after the cpu_up() invocation. Fixes: 18415f33e2ac ("cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE") Reported-by: Mark Brown <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Mark Brown <[email protected]> Link: https://lore.kernel.org/r/87wn10ufj9.ffs@tglx
2023-05-23tracing/timerlat: Always wakeup the timerlat threadDaniel Bristot de Oliveira1-0/+2
While testing rtla timerlat auto analysis, I reach a condition where the interface was not receiving tracing data. I was able to manually reproduce the problem with these steps: # echo 0 > tracing_on # disable trace # echo 1 > osnoise/stop_tracing_us # stop trace if timerlat irq > 1 us # echo timerlat > current_tracer # enable timerlat tracer # sleep 1 # wait... that is the time when rtla # apply configs like prio or cgroup # echo 1 > tracing_on # start tracing # cat trace # tracer: timerlat # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # ||||| ACTIVATION # TASK-PID CPU# ||||| TIMESTAMP ID CONTEXT LATENCY # | | | ||||| | | | | NOTHING! Then, trying to enable tracing again with echo 1 > tracing_on resulted in no change: the trace was still not tracing. This problem happens because the timerlat IRQ hits the stop tracing condition while tracing is off, and do not wake up the timerlat thread, so the timerlat threads are kept sleeping forever, resulting in no trace, even after re-enabling the tracer. Avoid this condition by always waking up the threads, even after stopping tracing, allowing the tracer to return to its normal operating after a new tracing on. Link: https://lore.kernel.org/linux-trace-kernel/1ed8f830638b20a39d535d27d908e319a9a3c4e2.1683822622.git.bristot@kernel.org Cc: Juri Lelli <[email protected]> Cc: [email protected] Fixes: a955d7eac177 ("trace: Add timerlat tracer") Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-23bpf: Validate BPF object in BPF_OBJ_PIN before calling LSMAndrii Nakryiko1-6/+5
Do a sanity check whether provided file-to-be-pinned is actually a BPF object (prog, map, btf) before calling security_path_mknod LSM hook. If it's not, LSM hook doesn't have to be triggered, as the operation has no chance of succeeding anyways. Suggested-by: Christian Brauner <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-05-23tracing/user_events: Use long vs int for atomic bit opsBeau Belgrave1-7/+8
Each event stores a int to track which bit to set/clear when enablement changes. On big endian 64-bit configurations, it's possible this could cause memory corruption when it's used for atomic bit operations. Use unsigned long for enablement values to ensure any possible corruption cannot occur. Downcast to int after mask for the bit target. Link: https://lore.kernel.org/all/[email protected]/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Fixes: dcb8177c1395 ("tracing/user_events: Add ioctl for disabling addresses") Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-05-22cgroup: always put cset in cgroup_css_set_put_forkJohn Sperbeck1-9/+8
A successful call to cgroup_css_set_fork() will always have taken a ref on kargs->cset (regardless of CLONE_INTO_CGROUP), so always do a corresponding put in cgroup_css_set_put_fork(). Without this, a cset and its contained css structures will be leaked for some fork failures. The following script reproduces the leak for a fork failure due to exceeding pids.max in the pids controller. A similar thing can happen if we jump to the bad_fork_cancel_cgroup label in copy_process(). [ -z "$1" ] && echo "Usage $0 pids-root" && exit 1 PID_ROOT=$1 CGROUP=$PID_ROOT/foo [ -e $CGROUP ] && rmdir -f $CGROUP mkdir $CGROUP echo 5 > $CGROUP/pids.max echo $$ > $CGROUP/cgroup.procs fork_bomb() { set -e for i in $(seq 10); do /bin/sleep 3600 & done } (fork_bomb) & wait echo $$ > $PID_ROOT/cgroup.procs kill $(cat $CGROUP/cgroup.procs) rmdir $CGROUP Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Cc: [email protected] # v5.7+ Signed-off-by: John Sperbeck <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-05-22module: Fix use-after-free bug in read_file_mod_stats()Harshit Mogalapalli1-1/+3
Smatch warns: kernel/module/stats.c:394 read_file_mod_stats() warn: passing freed memory 'buf' We are passing 'buf' to simple_read_from_buffer() after freeing it. Fix this by changing the order of 'simple_read_from_buffer' and 'kfree'. Fixes: df3e764d8e5c ("module: add debug stats to help identify memory pressure") Signed-off-by: Harshit Mogalapalli <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2023-05-22cgroup: Replace all non-returning strlcpy with strscpyAzeem Shaikh1-2/+2
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-05-22cgroup: fix missing cpus_read_{lock,unlock}() in cgroup_transfer_tasks()Qi Zheng1-2/+2
The commit 4f7e7236435c ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") fixed the deadlock between cgroup_threadgroup_rwsem and cpus_read_lock() by introducing cgroup_attach_{lock,unlock}() and removing cpus_read_{lock,unlock}() from cpuset_attach(). But cgroup_transfer_tasks() was missed and not handled, which will cause th following warning: WARNING: CPU: 0 PID: 589 at kernel/cpu.c:526 lockdep_assert_cpus_held+0x32/0x40 CPU: 0 PID: 589 Comm: kworker/1:4 Not tainted 6.4.0-rc2-next-20230517 #50 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: events cpuset_hotplug_workfn RIP: 0010:lockdep_assert_cpus_held+0x32/0x40 <...> Call Trace: <TASK> cpuset_attach+0x40/0x240 cgroup_migrate_execute+0x452/0x5e0 ? _raw_spin_unlock_irq+0x28/0x40 cgroup_transfer_tasks+0x1f3/0x360 ? find_held_lock+0x32/0x90 ? cpuset_hotplug_workfn+0xc81/0xed0 cpuset_hotplug_workfn+0xcb1/0xed0 ? process_one_work+0x248/0x5b0 process_one_work+0x2b9/0x5b0 worker_thread+0x56/0x3b0 ? process_one_work+0x5b0/0x5b0 kthread+0xf1/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> So just use the cgroup_attach_{lock,unlock}() helper to fix it. Reported-by: Zhao Gongyi <[email protected]> Signed-off-by: Qi Zheng <[email protected]> Acked-by: Muchun Song <[email protected]> Fixes: 05c7b7a92cc8 ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug") Cc: [email protected] # v5.17+ Signed-off-by: Tejun Heo <[email protected]>
2023-05-22capability: fix kernel-doc warnings in capability.cGaosheng Cui1-0/+2
Fix all kernel-doc warnings in capability.c: kernel/capability.c:477: warning: Function parameter or member 'idmap' not described in 'privileged_wrt_inode_uidgid' kernel/capability.c:493: warning: Function parameter or member 'idmap' not described in 'capable_wrt_inode_uidgid' Signed-off-by: Gaosheng Cui <[email protected]> Acked-by: Serge Hallyn <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2023-05-22bpf: fix a memory leak in the LRU and LRU_PERCPU hash mapsAnton Protopopov1-2/+4
The LRU and LRU_PERCPU maps allocate a new element on update before locking the target hash table bucket. Right after that the maps try to lock the bucket. If this fails, then maps return -EBUSY to the caller without releasing the allocated element. This makes the element untracked: it doesn't belong to either of free lists, and it doesn't belong to the hash table, so can't be re-used; this eventually leads to the permanent -ENOMEM on LRU map updates, which is unexpected. Fix this by returning the element to the local free list if bucket locking fails. Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Signed-off-by: Anton Protopopov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-05-20cgroup/cpuset: remove unneeded header filesMiaohe Lin1-21/+0
Remove some unnecessary header files. No functional change intended. Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-05-20sched/psi: Avoid resetting the min update period when it is unnecessaryYang Yang1-5/+10
Psi_group's poll_min_period is determined by the minimum window size of psi_trigger when creating new triggers. While destroying a psi_trigger, there is no need to reset poll_min_period if the psi_trigger being destroyed did not have the minimum window size, since in this condition poll_min_period will remain the same as before. Signed-off-by: Yang Yang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-05-19bpf: Add kfunc filter function to 'struct btf_kfunc_id_set'Aditi Ghag2-14/+58
This commit adds the ability to filter kfuncs to certain BPF program types. This is required to limit bpf_sock_destroy kfunc implemented in follow-up commits to programs with attach type 'BPF_TRACE_ITER'. The commit adds a callback filter to 'struct btf_kfunc_id_set'. The filter has access to the `bpf_prog` construct including its properties such as `expected_attached_type`. Signed-off-by: Aditi Ghag <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>