aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-06-09watchdog/hardlockup: style changes to watchdog_hardlockup_check() / ↵Douglas Anderson1-17/+15
is_hardlockup() These are tiny style changes: - Add a blank line before a "return". - Renames two globals to use the "watchdog_hardlockup" prefix. - Store processor id in "unsigned int" rather than "int". - Minor comment rewording. - Use "else" rather than extra returns since it seemed more symmetric. Link: https://lkml.kernel.org/r/20230519101840.v5.9.I818492c326b632560b09f20d2608455ecf9d3650@changeid Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Douglas Anderson <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/hardlockup: move perf hardlockup checking/panic to common watchdog.cDouglas Anderson2-61/+74
The perf hardlockup detector works by looking at interrupt counts and seeing if they change from run to run. The interrupt counts are managed by the common watchdog code via its watchdog_timer_fn(). Currently the API between the perf detector and the common code is a function: is_hardlockup(). When the hard lockup detector sees that function return true then it handles printing out debug info and inducing a panic if necessary. Let's change the API a little bit in preparation for the buddy hardlockup detector. The buddy hardlockup detector wants to print nearly the same debug info and have nearly the same panic behavior. That means we want to move all that code to the common file. For now, the code in the common file will only be there if the perf hardlockup detector is enabled, but eventually it will be selected by a common config. Right now, this _just_ moves the code from the perf detector file to the common file and changes the names. It doesn't make the changes that the buddy hardlockup detector will need and doesn't do any style cleanups. A future patch will do cleanup to make it more obvious what changed. With the above, we no longer have any callers of is_hardlockup() outside of the "watchdog.c" file, so we can remove it from the header, make it static, and move it to the same "#ifdef" block as our new watchdog_hardlockup_check(). While doing this, it can be noted that even if no hardlockup detectors were configured the existing code used to still have the code for counting/checking "hrtimer_interrupts" even if the perf hardlockup detector wasn't configured. We didn't need to do that, so move all the "hrtimer_interrupts" counting to only be there if the perf hardlockup detector is configured as well. This change is expected to be a no-op. Link: https://lkml.kernel.org/r/20230519101840.v5.8.Id4133d3183e798122dc3b6205e7852601f289071@changeid Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/perf: rename watchdog_hld.c to watchdog_perf.cDouglas Anderson2-2/+2
The code currently in "watchdog_hld.c" is for detecting hardlockups using perf, as evidenced by the line in the Makefile that only compiles this file if CONFIG_HARDLOCKUP_DETECTOR_PERF is defined. Rename the file to prepare for the buddy hardlockup detector, which doesn't use perf. It could be argued that the new name makes it less obvious that this is a hardlockup detector. While true, it's not hard to remember that the "perf" detector is always a hardlockup detector and it's nice not to have names that are too convoluted. Link: https://lkml.kernel.org/r/20230519101840.v5.7.Ice803cb078d0e15fb2cbf49132f096ee2bd4199d@changeid Signed-off-by: Douglas Anderson <[email protected]> Acked-by: Nicholas Piggin <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/perf: ensure CPU-bound context when creating hardlockup detector eventPingfan Liu1-1/+7
hardlockup_detector_event_create() should create perf_event on the current CPU. Preemption could not get disabled because perf_event_create_kernel_counter() allocates memory. Instead, the CPU locality is achieved by processing the code in a per-CPU bound kthread. Add a check to prevent mistakes when calling the code in another code path. Link: https://lkml.kernel.org/r/20230519101840.v5.5.I654063e53782b11d53e736a8ad4897ffd207406a@changeid Signed-off-by: Pingfan Liu <[email protected]> Co-developed-by: Lecopzer Chen <[email protected]> Signed-off-by: Lecopzer Chen <[email protected]> Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/hardlockup: change watchdog_nmi_enable() to voidLecopzer Chen1-2/+1
Nobody cares about the return value of watchdog_nmi_enable(), changing its prototype to void. Link: https://lkml.kernel.org/r/20230519101840.v5.4.Ic3a19b592eb1ac4c6f6eade44ffd943e8637b6e5@changeid Signed-off-by: Pingfan Liu <[email protected]> Signed-off-by: Lecopzer Chen <[email protected]> Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Acked-by: Nicholas Piggin <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog: remove WATCHDOG_DEFAULTLecopzer Chen1-2/+0
No reference to WATCHDOG_DEFAULT, remove it. Link: https://lkml.kernel.org/r/20230519101840.v5.3.I6a729209a1320e0ad212176e250ff945b8f91b2a@changeid Signed-off-by: Pingfan Liu <[email protected]> Signed-off-by: Lecopzer Chen <[email protected]> Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/perf: more properly prevent false positives with turbo modesDouglas Anderson1-3/+3
Currently, in the watchdog_overflow_callback() we first check to see if the watchdog had been touched and _then_ we handle the workaround for turbo mode. This order should be reversed. Specifically, "touching" the hardlockup detector's watchdog should avoid lockups being detected for one period that should be roughly the same regardless of whether we're running turbo or not. That means that we should do the extra accounting for turbo _before_ we look at (and clear) the global indicating that we've been touched. NOTE: this fix is made based on code inspection. I am not aware of any reports where the old code would have generated false positives. That being said, this order seems more correct and also makes it easier down the line to share code with the "buddy" hardlockup detector. Link: https://lkml.kernel.org/r/20230519101840.v5.2.I843b0d1de3e096ba111a179f3adb16d576bef5c7@changeid Fixes: 7edaeb6841df ("kernel/watchdog: Prevent false positives with turbo modes") Signed-off-by: Douglas Anderson <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09kcov: add prototypes for helper functionsArnd Bergmann1-3/+4
A number of internal functions in kcov are only called from generated code and don't technically need a declaration, but 'make W=1' warns about global symbols without a prototype: kernel/kcov.c:199:14: error: no previous prototype for '__sanitizer_cov_trace_pc' [-Werror=missing-prototypes] kernel/kcov.c:264:14: error: no previous prototype for '__sanitizer_cov_trace_cmp1' [-Werror=missing-prototypes] kernel/kcov.c:270:14: error: no previous prototype for '__sanitizer_cov_trace_cmp2' [-Werror=missing-prototypes] kernel/kcov.c:276:14: error: no previous prototype for '__sanitizer_cov_trace_cmp4' [-Werror=missing-prototypes] kernel/kcov.c:282:14: error: no previous prototype for '__sanitizer_cov_trace_cmp8' [-Werror=missing-prototypes] kernel/kcov.c:288:14: error: no previous prototype for '__sanitizer_cov_trace_const_cmp1' [-Werror=missing-prototypes] kernel/kcov.c:295:14: error: no previous prototype for '__sanitizer_cov_trace_const_cmp2' [-Werror=missing-prototypes] kernel/kcov.c:302:14: error: no previous prototype for '__sanitizer_cov_trace_const_cmp4' [-Werror=missing-prototypes] kernel/kcov.c:309:14: error: no previous prototype for '__sanitizer_cov_trace_const_cmp8' [-Werror=missing-prototypes] kernel/kcov.c:316:14: error: no previous prototype for '__sanitizer_cov_trace_switch' [-Werror=missing-prototypes] Adding prototypes for these in a header solves that problem, but now there is a mismatch between the built-in type and the prototype on 64-bit architectures because they expect some functions to take a 64-bit 'unsigned long' argument rather than an 'unsigned long long' u64 type: include/linux/kcov.h:84:6: error: conflicting types for built-in function '__sanitizer_cov_trace_switch'; expected 'void(long long unsigned int, void *)' [-Werror=builtin-declaration-mismatch] 84 | void __sanitizer_cov_trace_switch(u64 val, u64 *cases); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ Avoid this as well with a custom type definition. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Rong Tao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09panic: hide unused global functionsArnd Bergmann1-2/+1
Building with W=1 shows warnings about two functions that have no declaration or caller in certain configurations: kernel/panic.c:688:6: error: no previous prototype for 'warn_slowpath_fmt' [-Werror=missing-prototypes] kernel/panic.c:710:6: error: no previous prototype for '__warn_printk' [-Werror=missing-prototypes] Enclose the definition in the same #ifdef check as the declaration. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Eric Paris <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Moore <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Russell King <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09locking: add lockevent_read() prototypeArnd Bergmann1-0/+4
lockevent_read() has a __weak definition and the only caller in kernel/locking/lock_events.c, plus a strong definition in qspinlock_stat.h that overrides it, but no other declaration. This causes a W=1 warning: kernel/locking/lock_events.c:61:16: error: no previous prototype for 'lockevent_read' [-Werror=missing-prototypes] Add shared prototype to avoid the warnings. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Eric Paris <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Moore <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Russell King <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09fork: optimize memcg_charge_kernel_stack() a bitHaifeng Xu1-7/+3
Since commit f1c1a9ee00e4 ("fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK"), memcg_charge_kernel_stack() has been moved into CONFIG_VMAP_STACK block, so the CONFIG_VMAP_STACK check can be removed. Furthermore, memcg_charge_kernel_stack() is only invoked by alloc_thread_stack_node() instead of dup_task_struct(). If memcg_kmem_charge_page() fails, the uncharge process is handled in memcg_charge_kernel_stack() itself instead of free_thread_stack(), so remove the incorrect comments. If memcg_charge_kernel_stack() fails to charge pages used by kernel stack, only charged pages need to be uncharged. It's unnecessary to uncharge those pages which memory cgroup pointer is NULL. [[email protected]: remove assertion that PAGE_SIZE is a multiple of 1k] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Haifeng Xu <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09kthread: fix spelling typo and grammar in commentsPrathu Baronia1-2/+2
- `If present` -> `If present,' - `reuturn` -> `return` - `function exit safely` -> `function to exit safely` Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Prathu Baronia <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jason A. Donenfeld <[email protected]> Cc: Kees Cook <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Sami Tolvanen <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Zqiang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm/gup: remove vmas parameter from get_user_pages_remote()Lorenzo Stoakes1-8/+5
The only instances of get_user_pages_remote() invocations which used the vmas parameter were for a single page which can instead simply look up the VMA directly. In particular:- - __update_ref_ctr() looked up the VMA but did nothing with it so we simply remove it. - __access_remote_vm() was already using vma_lookup() when the original lookup failed so by doing the lookup directly this also de-duplicates the code. We are able to perform these VMA operations as we already hold the mmap_lock in order to be able to call get_user_pages_remote(). As part of this work we add get_user_page_vma_remote() which abstracts the VMA lookup, error handling and decrementing the page reference count should the VMA lookup fail. This forms part of a broader set of patches intended to eliminate the vmas parameter altogether. [[email protected]: avoid passing NULL to PTR_ERR] Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> (for arm64) Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Janosch Frank <[email protected]> (for s390) Reviewed-by: Christoph Hellwig <[email protected]> Cc: Christian König <[email protected]> Cc: Dennis Dalessandro <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Jarkko Sakkinen <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Sakari Ailus <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm/gup: remove unused vmas parameter from pin_user_pages_remote()Lorenzo Stoakes1-1/+1
No invocation of pin_user_pages_remote() uses the vmas parameter, so remove it. This forms part of a larger patch set eliminating the use of the vmas parameters altogether. Link: https://lkml.kernel.org/r/28f000beb81e45bf538a2aaa77c90f5482b67a32.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian König <[email protected]> Cc: Dennis Dalessandro <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jarkko Sakkinen <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Sakari Ailus <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm: page_alloc: move sysctls into it own filsKefeng Wang1-67/+0
This moves all page alloc related sysctls to its own file, as part of the kernel/sysctl.c spring cleaning, also move some functions declarations from mm.h into internal.h. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Len Brown <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm: page_alloc: move pm_* function into powerKefeng Wang2-0/+32
pm_restrict_gfp_mask()/pm_restore_gfp_mask() only used in power, let's move them out of page_alloc.c. Adding a general gfp_has_io_fs() function which return true if gfp with both __GFP_IO and __GFP_FS flags, then use it inside of pm_suspended_storage(), also the pm_suspended_storage() is moved into suspend.h. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Len Brown <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09mm: page_alloc: move mark_free_page() into snapshot.cKefeng Wang1-0/+52
The mark_free_page() is only used in kernel/power/snapshot.c, move it out to reduce a bit of page_alloc.c Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Len Brown <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09cachestat: implement cachestat syscallNhat Pham1-0/+1
There is currently no good way to query the page cache state of large file sets and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really doesn not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in the following thread: https://lore.kernel.org/lkml/[email protected]/ This patch implements a new syscall that queries cache state of a file and summarizes the number of cached pages, number of dirty pages, number of pages marked for writeback, number of (recently) evicted pages, etc. in a given range. Currently, the syscall is only wired in for x86 architecture. NAME cachestat - query the page cache statistics of a file. SYNOPSIS #include <sys/mman.h> struct cachestat_range { __u64 off; __u64 len; }; struct cachestat { __u64 nr_cache; __u64 nr_dirty; __u64 nr_writeback; __u64 nr_evicted; __u64 nr_recently_evicted; }; int cachestat(unsigned int fd, struct cachestat_range *cstat_range, struct cachestat *cstat, unsigned int flags); DESCRIPTION cachestat() queries the number of cached pages, number of dirty pages, number of pages marked for writeback, number of evicted pages, number of recently evicted pages, in the bytes range given by `off` and `len`. An evicted page is a page that is previously in the page cache but has been evicted since. A page is recently evicted if its last eviction was recent enough that its reentry to the cache would indicate that it is actively being used by the system, and that there is memory pressure on the system. These values are returned in a cachestat struct, whose address is given by the `cstat` argument. The `off` and `len` arguments must be non-negative integers. If `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == 0, we will query in the range from `off` to the end of the file. The `flags` argument is unused for now, but is included for future extensibility. User should pass 0 (i.e no flag specified). Currently, hugetlbfs is not supported. Because the status of a page can change after cachestat() checks it but before it returns to the application, the returned values may contain stale information. RETURN VALUE On success, cachestat returns 0. On error, -1 is returned, and errno is set to indicate the error. ERRORS EFAULT cstat or cstat_args points to an invalid address. EINVAL invalid flags. EBADF invalid file descriptor. EOPNOTSUPP file descriptor is of a hugetlbfs file [[email protected]: replace rounddown logic with the existing helper] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Brian Foster <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Kerrisk <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09cgroup: remove cgroup_rstat_flush_atomic()Yosry Ahmed1-21/+5
Previous patches removed the only caller of cgroup_rstat_flush_atomic(). Remove the function and simplify the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-7/+9
Pull virtio bug fixes from Michael Tsirkin: "A bunch of fixes all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: tools/virtio: use canonical ftrace path vhost_vdpa: support PACKED when setting-getting vring_base vhost: support PACKED when setting-getting vring_base vhost: Fix worker hangs due to missed wake up calls vhost: Fix crash during early vhost_transport_send_pkt calls vhost_net: revert upend_idx only on retriable error vhost_vdpa: tell vqs about the negotiated vdpa/mlx5: Fix hang when cvq commands are triggered during device unregister tools/virtio: Add .gitignore for ringtest tools/virtio: Fix arm64 ringtest compilation error vduse: avoid empty string for dev name vhost: use kzalloc() instead of kmalloc() followed by memset()
2023-06-09kcsan: Don't expect 64 bits atomic builtins from 32 bits architecturesChristophe Leroy1-0/+2
Activating KCSAN on a 32 bits architecture leads to the following link-time failure: LD .tmp_vmlinux.kallsyms1 powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_load': kernel/kcsan/core.c:1273: undefined reference to `__atomic_load_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_store': kernel/kcsan/core.c:1273: undefined reference to `__atomic_store_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_exchange': kernel/kcsan/core.c:1273: undefined reference to `__atomic_exchange_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_add': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_add_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_sub': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_sub_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_and': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_and_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_or': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_or_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_xor': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_xor_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_nand': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_nand_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_compare_exchange_strong': kernel/kcsan/core.c:1273: undefined reference to `__atomic_compare_exchange_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_compare_exchange_weak': kernel/kcsan/core.c:1273: undefined reference to `__atomic_compare_exchange_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_compare_exchange_val': kernel/kcsan/core.c:1273: undefined reference to `__atomic_compare_exchange_8' 32 bits architectures don't have 64 bits atomic builtins. Only include DEFINE_TSAN_ATOMIC_OPS(64) on 64 bits architectures. Fixes: 0f8ad5f2e934 ("kcsan: Add support for atomic builtins") Suggested-by: Marco Elver <[email protected]> Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Marco Elver <[email protected]> Acked-by: Marco Elver <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/d9c6afc28d0855240171a4e0ad9ffcdb9d07fceb.1683892665.git.christophe.leroy@csgroup.eu
2023-06-08Merge tag 'cgroup-for-6.4-rc5-fixes' of ↵Linus Torvalds2-11/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix css_set reference leaks on fork failures - Fix CPU hotplug locking in cgroup_transfer_tasks() which is used by cgroup1 cpuset - Doc update * tag 'cgroup-for-6.4-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Documentation: Clarify usage of memory limits cgroup: always put cset in cgroup_css_set_put_fork cgroup: fix missing cpus_read_{lock,unlock}() in cgroup_transfer_tasks()
2023-06-08sysctl: move security keys sysctl registration to its own fileLuis Chamberlain1-4/+0
The security keys sysctls are already declared on its own file, just move the sysctl registration to its own file to help avoid merge conflicts on sysctls.c, and help with clearing up sysctl.c further. This creates a small penalty of 23 bytes: ./scripts/bloat-o-meter vmlinux.1 vmlinux.2 add/remove: 2/0 grow/shrink: 0/1 up/down: 49/-26 (23) Function old new delta init_security_keys_sysctls - 33 +33 __pfx_init_security_keys_sysctls - 16 +16 sysctl_init_bases 85 59 -26 Total: Before=21256937, After=21256960, chg +0.00% But soon we'll be saving tons of bytes anyway, as we modify the sysctl registrations to use ARRAY_SIZE and so we get rid of all the empty array elements so let's just clean this up now. Reviewed-by: Paul Moore <[email protected]> Acked-by: Jarkko Sakkinen <[email protected]> Acked-by: David Howells <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2023-06-08sysctl: move umh sysctl registration to its own fileLuis Chamberlain2-2/+10
Move the umh sysctl registration to its own file, the array is already there. We do this to remove the clutter out of kernel/sysctl.c to avoid merge conflicts. This also lets the sysctls not be built at all now when CONFIG_SYSCTL is not enabled. This has a small penalty of 23 bytes but soon we'll be removing all the empty entries on sysctl arrays so just do this cleanup now: ./scripts/bloat-o-meter vmlinux.base vmlinux.1 add/remove: 2/0 grow/shrink: 0/1 up/down: 49/-26 (23) Function old new delta init_umh_sysctls - 33 +33 __pfx_init_umh_sysctls - 16 +16 sysctl_init_bases 111 85 -26 Total: Before=21256914, After=21256937, chg +0.00% Acked-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2023-06-08vhost: Fix worker hangs due to missed wake up callsMike Christie1-7/+9
We can race where we have added work to the work_list, but vhost_task_fn has passed that check but not yet set us into TASK_INTERRUPTIBLE. wake_up_process will see us in TASK_RUNNING and just return. This bug was intoduced in commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") when I moved the setting of TASK_INTERRUPTIBLE to simplfy the code and avoid get_signal from logging warnings about being in the wrong state. This moves the setting of TASK_INTERRUPTIBLE back to before we test if we need to stop the task to avoid a possible race there as well. We then have vhost_worker set TASK_RUNNING if it finds work similar to before. Fixes: f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") Signed-off-by: Mike Christie <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>
2023-06-08kallsyms: make kallsyms_show_value() as generic functionManinder Singh1-2/+0
This change makes function kallsyms_show_value() as generic function without dependency on CONFIG_KALLSYMS. Now module address will be displayed with lsmod and /proc/modules. Earlier: ======= / # insmod test.ko / # lsmod test 12288 0 - Live 0x0000000000000000 (O) // No Module Load address / # With change: ========== / # insmod test.ko / # lsmod test 12288 0 - Live 0xffff800000fc0000 (O) // Module address / # cat /proc/modules test 12288 0 - Live 0xffff800000fc0000 (O) Co-developed-by: Onkarnath <[email protected]> Signed-off-by: Onkarnath <[email protected]> Signed-off-by: Maninder Singh <[email protected]> Reviewed-by: Zhen Lei <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2023-06-08kallsyms: move kallsyms_show_value() out of kallsyms.cManinder Singh3-36/+46
function kallsyms_show_value() is used by other parts like modules_open(), kprobes_read() etc. which can work in case of !KALLSYMS also. e.g. as of now lsmod do not show module address if KALLSYMS is disabled. since kallsyms_show_value() defination is not present, it returns false in !KALLSYMS. / # lsmod test 12288 0 - Live 0x0000000000000000 (O) So kallsyms_show_value() can be made generic without dependency on KALLSYMS. Thus moving out function to a new file ksyms_common.c. With this patch code is just moved to new file and no functional change. Co-developed-by: Onkarnath <[email protected]> Signed-off-by: Onkarnath <[email protected]> Signed-off-by: Maninder Singh <[email protected]> Reviewed-by: Zhen Lei <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2023-06-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski6-6/+29
Cross-merge networking fixes after downstream PR. Conflicts: net/sched/sch_taprio.c d636fc5dd692 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping") dced11ef84fb ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()") net/ipv4/sysctl_net_ipv4.c e209fee4118f ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294") ccce324dabfe ("tcp: make the first N SYN RTO backoffs linear") https://lore.kernel.org/all/[email protected]/ No adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
2023-06-08Merge tag 'net-6.4-rc6' of ↵Linus Torvalds4-4/+27
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from can, wifi, netfilter, bluetooth and ebpf. Current release - regressions: - bpf: sockmap: avoid potential NULL dereference in sk_psock_verdict_data_ready() - wifi: iwlwifi: fix -Warray-bounds bug in iwl_mvm_wait_d3_notif() - phylink: actually fix ksettings_set() ethtool call - eth: dwmac-qcom-ethqos: fix a regression on EMAC < 3 Current release - new code bugs: - wifi: mt76: fix possible NULL pointer dereference in mt7996_mac_write_txwi() Previous releases - regressions: - netfilter: fix NULL pointer dereference in nf_confirm_cthelper - wifi: rtw88/rtw89: correct PS calculation for SUPPORTS_DYNAMIC_PS - openvswitch: fix upcall counter access before allocation - bluetooth: - fix use-after-free in hci_remove_ltk/hci_remove_irk - fix l2cap_disconnect_req deadlock - nic: bnxt_en: prevent kernel panic when receiving unexpected PHC_UPDATE event Previous releases - always broken: - core: annotate rfs lockless accesses - sched: fq_pie: ensure reasonable TCA_FQ_PIE_QUANTUM values - netfilter: add null check for nla_nest_start_noflag() in nft_dump_basechain_hook() - bpf: fix UAF in task local storage - ipv4: ping_group_range: allow GID from 2147483648 to 4294967294 - ipv6: rpl: fix route of death. - tcp: gso: really support BIG TCP - mptcp: fixes for user-space PM address advertisement - smc: avoid to access invalid RMBs' MRs in SMCRv1 ADD LINK CONT - can: avoid possible use-after-free when j1939_can_rx_register fails - batman-adv: fix UaF while rescheduling delayed work - eth: qede: fix scheduling while atomic - eth: ice: make writes to /dev/gnssX synchronous" * tag 'net-6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits) bnxt_en: Implement .set_port / .unset_port UDP tunnel callbacks bnxt_en: Prevent kernel panic when receiving unexpected PHC_UPDATE event bnxt_en: Skip firmware fatal error recovery if chip is not accessible bnxt_en: Query default VLAN before VNIC setup on a VF bnxt_en: Don't issue AP reset during ethtool's reset operation bnxt_en: Fix bnxt_hwrm_update_rss_hash_cfg() net: bcmgenet: Fix EEE implementation eth: ixgbe: fix the wake condition eth: bnxt: fix the wake condition lib: cpu_rmap: Fix potential use-after-free in irq_cpu_rmap_release() bpf: Add extra path pointer check to d_path helper net: sched: fix possible refcount leak in tc_chain_tmplt_add() net: sched: act_police: fix sparse errors in tcf_police_dump() net: openvswitch: fix upcall counter access before allocation net: sched: move rtm_tca_policy declaration to include file ice: make writes to /dev/gnssX synchronous net: sched: add rcu annotations around qdisc->qdisc_sleeping rfs: annotate lockless accesses to RFS sock flow table rfs: annotate lockless accesses to sk->sk_rxhash virtio_net: use control_buf for coalesce params ...
2023-06-08riscv: Add prctl controls for userspace vector managementAndy Chiu1-0/+12
This patch add two riscv-specific prctls, to allow usespace control the use of vector unit: * PR_RISCV_V_SET_CONTROL: control the permission to use Vector at next, or all following execve for a thread. Turning off a thread's Vector live is not possible since libraries may have registered ifunc that may execute Vector instructions. * PR_RISCV_V_GET_CONTROL: get the same permission setting for the current thread, and the setting for following execve(s). Signed-off-by: Andy Chiu <[email protected]> Reviewed-by: Greentime Hu <[email protected]> Reviewed-by: Vincent Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2023-06-08bpf: Fix verifier id tracking of scalars on spillMaxim Mikityanskiy1-0/+3
The following scenario describes a bug in the verifier where it incorrectly concludes about equivalent scalar IDs which could lead to verifier bypass in privileged mode: 1. Prepare a 32-bit rogue number. 2. Put the rogue number into the upper half of a 64-bit register, and roll a random (unknown to the verifier) bit in the lower half. The rest of the bits should be zero (although variations are possible). 3. Assign an ID to the register by MOVing it to another arbitrary register. 4. Perform a 32-bit spill of the register, then perform a 32-bit fill to another register. Due to a bug in the verifier, the ID will be preserved, although the new register will contain only the lower 32 bits, i.e. all zeros except one random bit. At this point there are two registers with different values but the same ID, which means the integrity of the verifier state has been corrupted. 5. Compare the new 32-bit register with 0. In the branch where it's equal to 0, the verifier will believe that the original 64-bit register is also 0, because it has the same ID, but its actual value still contains the rogue number in the upper half. Some optimizations of the verifier prevent the actual bypass, so extra care is needed: the comparison must be between two registers, and both branches must be reachable (this is why one random bit is needed). Both branches are still suitable for the bypass. 6. Right shift the original register by 32 bits to pop the rogue number. 7. Use the rogue number as an offset with any pointer. The verifier will believe that the offset is 0, while in reality it's the given number. The fix is similar to the 32-bit BPF_MOV handling in check_alu_op for SCALAR_VALUE. If the spill is narrowing the actual register value, don't keep the ID, make sure it's reset to 0. Fixes: 354e8f1970f8 ("bpf: Support <8-byte scalar spill and refill") Signed-off-by: Maxim Mikityanskiy <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Andrii Nakryiko <[email protected]> # Checked veristat delta Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-06-07Merge tag 'for-netdev' of ↵Jakub Kicinski4-4/+27
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-06-07 We've added 7 non-merge commits during the last 7 day(s) which contain a total of 12 files changed, 112 insertions(+), 7 deletions(-). The main changes are: 1) Fix a use-after-free in BPF's task local storage, from KP Singh. 2) Make struct path handling more robust in bpf_d_path, from Jiri Olsa. 3) Fix a syzbot NULL-pointer dereference in sockmap, from Eric Dumazet. 4) UAPI fix for BPF_NETFILTER before final kernel ships, from Florian Westphal. 5) Fix map-in-map array_map_gen_lookup code generation where elem_size was not being set for inner maps, from Rhys Rustad-Elliott. 6) Fix sockopt_sk selftest's NETLINK_LIST_MEMBERSHIPS assertion, from Yonghong Song. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Add extra path pointer check to d_path helper selftests/bpf: Fix sockopt_sk selftest bpf: netfilter: Add BPF_NETFILTER bpf_attach_type selftests/bpf: Add access_inner_map selftest bpf: Fix elem_size not being set for inner maps bpf: Fix UAF in task local storage bpf, sockmap: Avoid potential NULL dereference in sk_psock_verdict_data_ready() ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-06-07Merge branches 'doc.2023.05.10a', 'fixes.2023.05.11a', 'kvfree.2023.05.10a', ↵Paul E. McKenney9-180/+295
'nocb.2023.05.11a', 'rcu-tasks.2023.05.10a', 'torture.2023.05.15a' and 'rcu-urgent.2023.06.06a' into HEAD doc.2023.05.10a: Documentation updates fixes.2023.05.11a: Miscellaneous fixes kvfree.2023.05.10a: kvfree_rcu updates nocb.2023.05.11a: Callback-offloading updates rcu-tasks.2023.05.10a: Tasks RCU updates torture.2023.05.15a: Torture-test updates rcu-urgent.2023.06.06a: Urgent SRCU fix
2023-06-07swiotlb: use the atomic counter of total used slabs if availablePetr Tesarik1-0/+11
If DEBUG_FS is enabled, the cost of keeping an exact number of total used slabs is already paid. In this case, there is no reason to use an inexact number for statistics and kernel messages. Signed-off-by: Petr Tesarik <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2023-06-07dma-remap: use kvmalloc_array/kvfree for larger dma memory remapgaoxu1-2/+2
If dma_direct_alloc() alloc memory in size of 64MB, the inner function dma_common_contiguous_remap() will allocate 128KB memory by invoking the function kmalloc_array(). and the kmalloc_array seems to fail to try to allocate 128KB mem. Call trace: [14977.928623] qcrosvm: page allocation failure: order:5, mode:0x40cc0 [14977.928638] dump_backtrace.cfi_jt+0x0/0x8 [14977.928647] dump_stack_lvl+0x80/0xb8 [14977.928652] warn_alloc+0x164/0x200 [14977.928657] __alloc_pages_slowpath+0x9f0/0xb4c [14977.928660] __alloc_pages+0x21c/0x39c [14977.928662] kmalloc_order+0x48/0x108 [14977.928666] kmalloc_order_trace+0x34/0x154 [14977.928668] __kmalloc+0x548/0x7e4 [14977.928673] dma_direct_alloc+0x11c/0x4f8 [14977.928678] dma_alloc_attrs+0xf4/0x138 [14977.928680] gh_vm_ioctl_set_fw_name+0x3c4/0x610 [gunyah] [14977.928698] gh_vm_ioctl+0x90/0x14c [gunyah] [14977.928705] __arm64_sys_ioctl+0x184/0x210 work around by doing kvmalloc_array instead. Signed-off-by: Gao Xu <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2023-06-07dma-mapping: fix a Kconfig typoSui Jingfeng1-1/+1
'thing' -> 'think' Signed-off-by: Sui Jingfeng <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2023-06-07bpf: Add extra path pointer check to d_path helperJiri Olsa1-1/+11
Anastasios reported crash on stable 5.15 kernel with following BPF attached to lsm hook: SEC("lsm.s/bprm_creds_for_exec") int BPF_PROG(bprm_creds_for_exec, struct linux_binprm *bprm) { struct path *path = &bprm->executable->f_path; char p[128] = { 0 }; bpf_d_path(path, p, 128); return 0; } But bprm->executable can be NULL, so bpf_d_path call will crash: BUG: kernel NULL pointer dereference, address: 0000000000000018 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI ... RIP: 0010:d_path+0x22/0x280 ... Call Trace: <TASK> bpf_d_path+0x21/0x60 bpf_prog_db9cf176e84498d9_bprm_creds_for_exec+0x94/0x99 bpf_trampoline_6442506293_0+0x55/0x1000 bpf_lsm_bprm_creds_for_exec+0x5/0x10 security_bprm_creds_for_exec+0x29/0x40 bprm_execve+0x1c1/0x900 do_execveat_common.isra.0+0x1af/0x260 __x64_sys_execve+0x32/0x40 It's problem for all stable trees with bpf_d_path helper, which was added in 5.9. This issue is fixed in current bpf code, where we identify and mark trusted pointers, so the above code would fail even to load. For the sake of the stable trees and to workaround potentially broken verifier in the future, adding the code that reads the path object from the passed pointer and verifies it's valid in kernel space. Fixes: 6e22ab9da793 ("bpf: Add d_path helper") Reported-by: Anastasios Papagiannis <[email protected]> Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-06-06bpf: Factor out a common helper free_all()Hou Tao1-15/+16
Factor out a common helper free_all() to free all normal elements or per-cpu elements on a lock-less list. Signed-off-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-06-06tracing/probes: Add BTF retval type supportMasami Hiramatsu (Google)2-7/+63
Check the target function has non-void retval type and set the correct fetch type if user doesn't specify it. If the function returns void, $retval is rejected as below; # echo 'f unregister_kprobes%return $retval' >> dynamic_events sh: write error: No such file or directory # cat error_log [ 37.488397] trace_fprobe: error: This function returns 'void' type Command: f unregister_kprobes%return $retval ^ Link: https://lore.kernel.org/all/168507476195.913472.16290308831790216609.stgit@mhiramat.roam.corp.google.com/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-06-06tracing/probes: Add $arg* meta argument for all function argsMasami Hiramatsu (Google)4-12/+212
Add the '$arg*' meta fetch argument for function-entry probe events. This will be expanded to the all arguments of the function and the tracepoint using BTF function argument information. e.g. # echo 'p vfs_read $arg*' >> dynamic_events # echo 'f vfs_write $arg*' >> dynamic_events # echo 't sched_overutilized_tp $arg*' >> dynamic_events # cat dynamic_events p:kprobes/p_vfs_read_0 vfs_read file=file buf=buf count=count pos=pos f:fprobes/vfs_write__entry vfs_write file=file buf=buf count=count pos=pos t:tracepoints/sched_overutilized_tp sched_overutilized_tp rd=rd overutilized=overutilized Also, single '$arg[0-9]*' will be converted to the BTF function argument. NOTE: This seems like a wildcard, but a fake one at this moment. This is just for telling user that this can be expanded to several arguments. And it is not like other $-vars, you can not use this $arg* as a part of fetch args, e.g. specifying name "foo=$arg*" and using it in dereferences "+0($arg*)" will lead a parse error. Link: https://lore.kernel.org/all/168507475126.913472.18329684401466211816.stgit@mhiramat.roam.corp.google.com/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-06-06tracing/probes: Support function parameters if BTF is availableMasami Hiramatsu (Google)6-31/+270
Support function or tracepoint parameters by name if BTF support is enabled and the event is for function entry (this feature can be used with kprobe- events, fprobe-events and tracepoint probe events.) Note that the BTF variable syntax does not require a prefix. If it starts with an alphabetic character or an underscore ('_') without a prefix like '$' and '%', it is considered as a BTF variable. If you specify only the BTF variable name, the argument name will also be the same name instead of 'arg*'. # echo 'p vfs_read count pos' >> dynamic_events # echo 'f vfs_write count pos' >> dynamic_events # echo 't sched_overutilized_tp rd overutilized' >> dynamic_events # cat dynamic_events p:kprobes/p_vfs_read_0 vfs_read count=count pos=pos f:fprobes/vfs_write__entry vfs_write count=count pos=pos t:tracepoints/sched_overutilized_tp sched_overutilized_tp rd=rd overutilized=overutilized Link: https://lore.kernel.org/all/168507474014.913472.16963996883278039183.stgit@mhiramat.roam.corp.google.com/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Alan Maguire <[email protected]> Tested-by: Alan Maguire <[email protected]>
2023-06-06tracing/probes: Move event parameter fetching code to common parserMasami Hiramatsu (Google)6-132/+155
Move trace event parameter fetching code to common parser in trace_probe.c. This simplifies eprobe's trace-event variable fetching code by introducing a parse context data structure. Link: https://lore.kernel.org/all/168507472950.913472.2812253181558471278.stgit@mhiramat.roam.corp.google.com/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-06-06tracing/probes: Add tracepoint support on fprobe_eventsMasami Hiramatsu (Google)5-16/+151
Allow fprobe_events to trace raw tracepoints so that user can trace tracepoints which don't have traceevent wrappers. This new event is always available if the fprobe_events is enabled (thus no kconfig), because the fprobe_events depends on the trace-event and traceporint. e.g. # echo 't sched_overutilized_tp' >> dynamic_events # echo 't 9p_client_req' >> dynamic_events # cat dynamic_events t:tracepoints/sched_overutilized_tp sched_overutilized_tp t:tracepoints/_9p_client_req 9p_client_req The event name is based on the tracepoint name, but if it is started with digit character, an underscore '_' will be added. NOTE: to avoid further confusion, this renames TPARG_FL_TPOINT to TPARG_FL_TEVENT because this flag is used for eprobe (trace-event probe). And reuse TPARG_FL_TPOINT for this raw tracepoint probe. Link: https://lore.kernel.org/all/168507471874.913472.17214624519622959593.stgit@mhiramat.roam.corp.google.com/ Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-06-06tracing/probes: Add fprobe events for tracing function entry and exit.Masami Hiramatsu (Google)9-7/+1100
Add fprobe events for tracing function entry and exit instead of kprobe events. With this change, we can continue to trace function entry/exit even if the CONFIG_KPROBES_ON_FTRACE is not available. Since CONFIG_KPROBES_ON_FTRACE requires the CONFIG_DYNAMIC_FTRACE_WITH_REGS, it is not available if the architecture only supports CONFIG_DYNAMIC_FTRACE_WITH_ARGS. And that means kprobe events can not probe function entry/exit effectively on such architecture. But this can be solved if the dynamic events supports fprobe events. The fprobe event is a new dynamic events which is only for the function (symbol) entry and exit. This event accepts non register fetch arguments so that user can trace the function arguments and return values. The fprobe events syntax is here; f[:[GRP/][EVENT]] FUNCTION [FETCHARGS] f[MAXACTIVE][:[GRP/][EVENT]] FUNCTION%return [FETCHARGS] E.g. # echo 'f vfs_read $arg1' >> dynamic_events # echo 'f vfs_read%return $retval' >> dynamic_events # cat dynamic_events f:fprobes/vfs_read__entry vfs_read arg1=$arg1 f:fprobes/vfs_read__exit vfs_read%return arg1=$retval # echo 1 > events/fprobes/enable # head -n 20 trace | tail # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | sh-142 [005] ...1. 448.386420: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540 sh-142 [005] ..... 448.386436: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1 sh-142 [005] ...1. 448.386451: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540 sh-142 [005] ..... 448.386458: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1 sh-142 [005] ...1. 448.386469: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540 sh-142 [005] ..... 448.386476: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1 sh-142 [005] ...1. 448.602073: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540 sh-142 [005] ..... 448.602089: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1 Link: https://lore.kernel.org/all/168507469754.913472.6112857614708350210.stgit@mhiramat.roam.corp.google.com/ Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-06-06tracing/probes: Avoid setting TPARG_FL_FENTRY and TPARG_FL_RETURNMasami Hiramatsu (Google)2-1/+6
When parsing a kprobe event, the return probe always sets both TPARG_FL_RETURN and TPARG_FL_FENTRY, but this is not useful because some fetchargs are only for return probe and some others only for function entry. Make it obviously mutual exclusive. Link: https://lore.kernel.org/all/168507468731.913472.11354553441385410734.stgit@mhiramat.roam.corp.google.com/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-06-06fprobe: Pass return address to the handlersMasami Hiramatsu (Google)4-6/+10
Pass return address as 'ret_ip' to the fprobe entry and return handlers so that the fprobe user handler can get the reutrn address without analyzing arch-dependent pt_regs. Link: https://lore.kernel.org/all/168507467664.913472.11642316698862778600.stgit@mhiramat.roam.corp.google.com/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-06-06watch_queue: prevent dangling pipe pointerSiddh Raman Pant1-6/+6
NULL the dangling pipe reference while clearing watch_queue. If not done, a reference to a freed pipe remains in the watch_queue, as this function is called before freeing a pipe in free_pipe_info() (see line 834 of fs/pipe.c). The sole use of wqueue->defunct is for checking if the watch queue has been cleared, but wqueue->pipe is also NULLed while clearing. Thus, wqueue->defunct is superfluous, as wqueue->pipe can be checked for NULL. Hence, the former can be removed. Tested with keyutils testsuite. Cc: [email protected] # 6.1 Signed-off-by: Siddh Raman Pant <[email protected]> Acked-by: David Howells <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-06-06perf: Re-instate the linear PMU searchPeter Zijlstra1-13/+24
Full revert of commit 9551fbb64d09 ("perf/core: Remove pmu linear searching code"). Some architectures (notably arm/arm64) still relied on the linear search in order to find the PMU that consumes PERF_TYPE_{HARDWARE,HW_CACHE,RAW}. This will need a more thorought audit and cleanup. Reported-by: Nathan Chancellor <[email protected]> Reported-by: Krzysztof Kozlowski <[email protected]> Acked-by: Mark Rutland <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-06-05bpf: netfilter: Add BPF_NETFILTER bpf_attach_typeFlorian Westphal1-0/+9
Andrii Nakryiko writes: And we currently don't have an attach type for NETLINK BPF link. Thankfully it's not too late to add it. I see that link_create() in kernel/bpf/syscall.c just bypasses attach_type check. We shouldn't have done that. Instead we need to add BPF_NETLINK attach type to enum bpf_attach_type. And wire all that properly throughout the kernel and libbpf itself. This adds BPF_NETFILTER and uses it. This breaks uabi but this wasn't in any non-rc release yet, so it should be fine. v2: check link_attack prog type in link_create too Fixes: 84601d6ee68a ("bpf: add bpf_link support for BPF_NETFILTER programs") Suggested-by: Andrii Nakryiko <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/CAEf4BzZ69YgrQW7DHCJUT_X+GqMq_ZQQPBwopaJJVGFD5=d5Vg@mail.gmail.com/ Link: https://lore.kernel.org/bpf/[email protected]
2023-06-05bpf: Teach verifier that trusted PTR_TO_BTF_ID pointers are non-NULLDavid Vernet1-2/+7
In reg_type_not_null(), we currently assume that a pointer may be NULL if it has the PTR_MAYBE_NULL modifier, or if it doesn't belong to one of several base type of pointers that are never NULL-able. For example, PTR_TO_CTX, PTR_TO_MAP_VALUE, etc. It turns out that in some cases, PTR_TO_BTF_ID can never be NULL as well, though we currently don't specify it. For example, if you had the following program: SEC("tc") long example_refcnt_fail(void *ctx) { struct bpf_cpumask *mask1, *mask2; mask1 = bpf_cpumask_create(); mask2 = bpf_cpumask_create(); if (!mask1 || !mask2) goto error_release; bpf_cpumask_test_cpu(0, (const struct cpumask *)mask1); bpf_cpumask_test_cpu(0, (const struct cpumask *)mask2); error_release: if (mask1) bpf_cpumask_release(mask1); if (mask2) bpf_cpumask_release(mask2); return ret; } The verifier will incorrectly fail to load the program, thinking (unintuitively) that we have a possibly-unreleased reference if the mask is NULL, because we (correctly) don't issue a bpf_cpumask_release() on the NULL path. The reason the verifier gets confused is due to the fact that we don't explicitly tell the verifier that trusted PTR_TO_BTF_ID pointers can never be NULL. Basically, if we successfully get past the if check (meaning both pointers go from ptr_or_null_bpf_cpumask to ptr_bpf_cpumask), the verifier will correctly assume that the references need to be dropped on any possible branch that leads to program exit. However, it will _incorrectly_ think that the ptr == NULL branch is possible, and will erroneously detect it as a branch on which we failed to drop the reference. The solution is of course to teach the verifier that trusted PTR_TO_BTF_ID pointers can never be NULL, so that it doesn't incorrectly think it's possible for the reference to be present on the ptr == NULL branch. A follow-on patch will add a selftest that verifies this behavior. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>