aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2021-11-06mm/page_alloc: add __alloc_size attributes for better bounds checkingKees Cook1-2/+2
As already done in GrapheneOS, add the __alloc_size attribute for appropriate page allocator interfaces, to provide additional hinting for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler optimizations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]> Co-developed-by: Daniel Micay <[email protected]> Signed-off-by: Daniel Micay <[email protected]> Cc: Andy Whitcroft <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Dwaipayan Ray <[email protected]> Cc: Joe Perches <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Lukas Bulwahn <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Gustavo A. R. Silva <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jing Xiangfeng <[email protected]> Cc: John Hubbard <[email protected]> Cc: kernel test robot <[email protected]> Cc: Matt Porter <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Souptick Joarder <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm/vmalloc: add __alloc_size attributes for better bounds checkingKees Cook1-11/+11
As already done in GrapheneOS, add the __alloc_size attribute for appropriate vmalloc allocator interfaces, to provide additional hinting for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler optimizations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]> Co-developed-by: Daniel Micay <[email protected]> Signed-off-by: Daniel Micay <[email protected]> Cc: Andy Whitcroft <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Dwaipayan Ray <[email protected]> Cc: Joe Perches <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Lukas Bulwahn <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Gustavo A. R. Silva <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jing Xiangfeng <[email protected]> Cc: John Hubbard <[email protected]> Cc: kernel test robot <[email protected]> Cc: Matt Porter <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Souptick Joarder <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm/kvmalloc: add __alloc_size attributes for better bounds checkingKees Cook1-8/+8
As already done in GrapheneOS, add the __alloc_size attribute for regular kvmalloc interfaces, to provide additional hinting for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler optimizations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]> Co-developed-by: Daniel Micay <[email protected]> Signed-off-by: Daniel Micay <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andy Whitcroft <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Dwaipayan Ray <[email protected]> Cc: Joe Perches <[email protected]> Cc: Lukas Bulwahn <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Gustavo A. R. Silva <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jing Xiangfeng <[email protected]> Cc: John Hubbard <[email protected]> Cc: kernel test robot <[email protected]> Cc: Matt Porter <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Souptick Joarder <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06slab: add __alloc_size attributes for better bounds checkingKees Cook1-28/+33
As already done in GrapheneOS, add the __alloc_size attribute for regular kmalloc interfaces, to provide additional hinting for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler optimizations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]> Co-developed-by: Daniel Micay <[email protected]> Signed-off-by: Daniel Micay <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andy Whitcroft <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Dwaipayan Ray <[email protected]> Cc: Joe Perches <[email protected]> Cc: Lukas Bulwahn <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Gustavo A. R. Silva <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jing Xiangfeng <[email protected]> Cc: John Hubbard <[email protected]> Cc: kernel test robot <[email protected]> Cc: Matt Porter <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Souptick Joarder <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06slab: clean up function prototypesKees Cook1-34/+34
Based on feedback from Joe Perches and Linus Torvalds, regularize the slab function prototypes before making attribute changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Andy Whitcroft <[email protected]> Cc: Daniel Micay <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Dwaipayan Ray <[email protected]> Cc: Gustavo A. R. Silva <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jing Xiangfeng <[email protected]> Cc: Joe Perches <[email protected]> Cc: John Hubbard <[email protected]> Cc: kernel test robot <[email protected]> Cc: Lukas Bulwahn <[email protected]> Cc: Matt Porter <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06Compiler Attributes: add __alloc_size() for better bounds checkingKees Cook3-0/+30
GCC and Clang can use the "alloc_size" attribute to better inform the results of __builtin_object_size() (for compile-time constant values). Clang can additionally use alloc_size to inform the results of __builtin_dynamic_object_size() (for run-time values). Because GCC sees the frequent use of struct_size() as an allocator size argument, and notices it can return SIZE_MAX (the overflow indication), it complains about these call sites overflowing (since SIZE_MAX is greater than the default -Walloc-size-larger-than=PTRDIFF_MAX). This isn't helpful since we already know a SIZE_MAX will be caught at run-time (this was an intentional design). To deal with this, we must disable this check as it is both a false positive and redundant. (Clang does not have this warning option.) Unfortunately, just checking the -Wno-alloc-size-larger-than is not sufficient to make the __alloc_size attribute behave correctly under older GCC versions. The attribute itself must be disabled in those situations too, as there appears to be no way to reliably silence the SIZE_MAX constant expression cases for GCC versions less than 9.1: In file included from ./include/linux/resource_ext.h:11, from ./include/linux/pci.h:40, from drivers/net/ethernet/intel/ixgbe/ixgbe.h:9, from drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c:4: In function 'kmalloc_node', inlined from 'ixgbe_alloc_q_vector' at ./include/linux/slab.h:743:9: ./include/linux/slab.h:618:9: error: argument 1 value '18446744073709551615' exceeds maximum object size 9223372036854775807 [-Werror=alloc-size-larger-than=] return __kmalloc_node(size, flags, node); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/linux/slab.h: In function 'ixgbe_alloc_q_vector': ./include/linux/slab.h:455:7: note: in a call to allocation function '__kmalloc_node' declared here void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_slab_alignment __malloc; ^~~~~~~~~~~~~~ Specifically: '-Wno-alloc-size-larger-than' is not correctly handled by GCC < 9.1 https://godbolt.org/z/hqsfG7q84 (doesn't disable) https://godbolt.org/z/P9jdrPTYh (doesn't admit to not knowing about option) https://godbolt.org/z/465TPMWKb (only warns when other warnings appear) '-Walloc-size-larger-than=18446744073709551615' is not handled by GCC < 8.2 https://godbolt.org/z/73hh1EPxz (ignores numeric value) Since anything marked with __alloc_size would also qualify for marking with __malloc, just include __malloc along with it to avoid redundant markings. (Suggested by Linus Torvalds.) Finally, make sure checkpatch.pl doesn't get confused about finding the __alloc_size attribute on functions. (Thanks to Joe Perches.) Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]> Tested-by: Randy Dunlap <[email protected]> Cc: Andy Whitcroft <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Daniel Micay <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Dwaipayan Ray <[email protected]> Cc: Joe Perches <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Lukas Bulwahn <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Gustavo A. R. Silva <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jing Xiangfeng <[email protected]> Cc: John Hubbard <[email protected]> Cc: kernel test robot <[email protected]> Cc: Matt Porter <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Souptick Joarder <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06kasan: generic: introduce kasan_record_aux_stack_noalloc()Marco Elver1-0/+2
Introduce a variant of kasan_record_aux_stack() that does not do any memory allocation through stackdepot. This will permit using it in contexts that cannot allocate any memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Tested-by: Shuah Khan <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: "Gustavo A. R. Silva" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Taras Madan <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vijayanand Jitta <[email protected]> Cc: Vinayak Menon <[email protected]> Cc: Walter Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06lib/stackdepot: introduce __stack_depot_save()Marco Elver1-0/+4
Add __stack_depot_save(), which provides more fine-grained control over stackdepot's memory allocation behaviour, in case stackdepot runs out of "stack slabs". Normally stackdepot uses alloc_pages() in case it runs out of space; passing can_alloc==false to __stack_depot_save() prohibits this, at the cost of more likely failure to record a stack trace. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Tested-by: Shuah Khan <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: "Gustavo A. R. Silva" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Taras Madan <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vijayanand Jitta <[email protected]> Cc: Vinayak Menon <[email protected]> Cc: Walter Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06lib/stackdepot: include gfp.hMarco Elver1-0/+2
Patch series "stackdepot, kasan, workqueue: Avoid expanding stackdepot slabs when holding raw_spin_lock", v2. Shuah Khan reported [1]: | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled, | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when | it tries to allocate memory attempting to acquire spinlock in page | allocation code while holding workqueue pool raw_spinlock. | | There are several instances of this problem when block layer tries | to __queue_work(). Call trace from one of these instances is below: | | kblockd_mod_delayed_work_on() | mod_delayed_work_on() | __queue_delayed_work() | __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held) | insert_work() | kasan_record_aux_stack() | kasan_save_stack() | stack_depot_save() | alloc_pages() | __alloc_pages() | get_page_from_freelist() | rm_queue() | rm_queue_pcplist() | local_lock_irqsave(&pagesets.lock, flags); | [ BUG: Invalid wait context triggered ] PROVE_RAW_LOCK_NESTING is pointing out that (on RT kernels) the locking rules are being violated. More generally, memory is being allocated from a non-preemptive context (raw_spin_lock'd c-s) where it is not allowed. To properly fix this, we must prevent stackdepot from replenishing its "stack slab" pool if memory allocations cannot be done in the current context: it's a bug to use either GFP_ATOMIC nor GFP_NOWAIT in certain non-preemptive contexts, including raw_spin_locks (see gfp.h and commit ab00db216c9c7). The only downside is that saving a stack trace may fail if: stackdepot runs out of space AND the same stack trace has not been recorded before. I expect this to be unlikely, and a simple experiment (boot the kernel) didn't result in any failure to record stack trace from insert_work(). The series includes a few minor fixes to stackdepot that I noticed in preparing the series. It then introduces __stack_depot_save(), which exposes the option to force stackdepot to not allocate any memory. Finally, KASAN is changed to use the new stackdepot interface and provide kasan_record_aux_stack_noalloc(), which is then used by workqueue code. [1] https://lkml.kernel.org/r/[email protected] This patch (of 6): <linux/stackdepot.h> refers to gfp_t, but doesn't include gfp.h. Fix it by including <linux/gfp.h>. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Tested-by: Shuah Khan <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Walter Wu <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vijayanand Jitta <[email protected]> Cc: Vinayak Menon <[email protected]> Cc: "Gustavo A. R. Silva" <[email protected]> Cc: Taras Madan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm: don't include <linux/dax.h> in <linux/mempolicy.h>Christoph Hellwig1-1/+0
Not required at all, and having this causes a huge kernel rebuild as soon as something in dax.h changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm, slub: change percpu partial accounting from objects to pagesVlastimil Babka2-13/+2
With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of partial slabs that can be promoted to cpu slab when the previous one is depleted, without accessing the shared partial list. A slab can be added to this list by 1) refill of an empty list from get_partial_node() - once we really have to access the shared partial list, we acquire multiple slabs to amortize the cost of locking, and 2) first free to a previously full slab - instead of putting the slab on a shared partial list, we can more cheaply freeze it and put it on the per-cpu list. To control how large a percpu partial list can grow for a kmem cache, set_cpu_partial() calculates a target number of free objects on each cpu's percpu partial list, and this can be also set by the sysfs file cpu_partial. However, the tracking of actual number of objects is imprecise, in order to limit overhead from cpu X freeing an objects to a slab on percpu partial list of cpu Y. Basically, the percpu partial slabs form a single linked list, and when we add a new slab to the list with current head "oldpage", we set in the struct page of the slab we're adding: page->pages = oldpage->pages + 1; // this is precise page->pobjects = oldpage->pobjects + (page->objects - page->inuse); page->next = oldpage; Thus the real number of free objects in the slab (objects - inuse) is only determined at the moment of adding the slab to the percpu partial list, and further freeing doesn't update the pobjects counter nor propagate it to the current list head. As Jann reports [1], this can easily lead to large inaccuracies, where the target number of objects (up to 30 by default) can translate to the same number of (empty) slab pages on the list. In case 2) above, we put a slab with 1 free object on the list, thus only increase page->pobjects by 1, even if there are subsequent frees on the same slab. Jann has noticed this in practice and so did we [2] when investigating significant increase of kmemcg usage after switching from SLAB to SLUB. While this is no longer a problem in kmemcg context thanks to the accounting rewrite in 5.9, the memory waste is still not ideal and it's questionable whether it makes sense to perform free object count based control when object counts can easily become so much inaccurate. So this patch converts the accounting to be based on number of pages only (which is precise) and removes the page->pobjects field completely. This is also ultimately simpler. To retain the existing set_cpu_partial() heuristic, first calculate the target number of objects as previously, but then convert it to target number of pages by assuming the pages will be half-filled on average. This assumption might obviously also be inaccurate in practice, but cannot degrade to actual number of pages being equal to the target number of objects. We could also skip the intermediate step with target number of objects and rewrite the heuristic in terms of pages. However we still have the sysfs file cpu_partial which uses number of objects and could break existing users if it suddenly becomes number of pages, so this patch doesn't do that. In practice, after this patch the heuristics limit the size of percpu partial list up to 2 pages. In case of a reported regression (which would mean some workload has benefited from the previous imprecise object based counting), we can tune the heuristics to get a better compromise within the new scheme, while still avoid the unexpectedly long percpu partial lists. [1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/ [2] https://lore.kernel.org/all/[email protected]/ ========== Evaluation ========== Mel was kind enough to run v1 through mmtests machinery for netperf (localhost) and hackbench and, for most significant results see below. So there are some apparent regressions, especially with hackbench, which I think ultimately boils down to having shorter percpu partial lists on average and some benchmarks benefiting from longer ones. Monitoring slab usage also indicated less memory usage by slab. Based on that, the following patch will bump the defaults to allow longer percpu partial lists than after this patch. However the goal is certainly not such that we would limit the percpu partial lists to 30 pages just because previously a specific alloc/free pattern could lead to the limit of 30 objects translate to a limit to 30 pages - that would make little sense. This is a correctness patch, and if a workload benefits from larger lists, the sysfs tuning knobs are still there to allow that. Netperf 2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM TCP-RR: hmean before 127045.79 after 121092.94 (-4.69%, worse) stddev before 2634.37 after 1254.08 UDP-RR: hmean before 166985.45 after 160668.94 ( -3.78%, worse) stddev before 4059.69 after 1943.63 2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM TCP-RR: hmean before 84173.25 after 76914.72 ( -8.62%, worse) UDP-RR: hmean before 93571.12 after 96428.69 ( 3.05%, better) stddev before 23118.54 after 16828.14 2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM TCP-RR: hmean before 49984.92 after 48922.27 ( -2.13%, worse) stddev before 6248.15 after 4740.51 UDP-RR: hmean before 61854.31 after 68761.81 ( 11.17%, better) stddev before 4093.54 after 5898.91 other machines - within 2% Hackbench (results before and after the patch, negative % means worse) 2-socket AMD EPYC 7713 (64 cores, 128 threads per core), 256GB RAM hackbench-process-sockets Amean 1 0.5380 0.5583 ( -3.78%) Amean 4 0.7510 0.8150 ( -8.52%) Amean 7 0.7930 0.9533 ( -20.22%) Amean 12 0.7853 1.1313 ( -44.06%) Amean 21 1.1520 1.4993 ( -30.15%) Amean 30 1.6223 1.9237 ( -18.57%) Amean 48 2.6767 2.9903 ( -11.72%) Amean 79 4.0257 5.1150 ( -27.06%) Amean 110 5.5193 7.4720 ( -35.38%) Amean 141 7.2207 9.9840 ( -38.27%) Amean 172 8.4770 12.1963 ( -43.88%) Amean 203 9.6473 14.3137 ( -48.37%) Amean 234 11.3960 18.7917 ( -64.90%) Amean 265 13.9627 22.4607 ( -60.86%) Amean 296 14.9163 26.0483 ( -74.63%) hackbench-thread-sockets Amean 1 0.5597 0.5877 ( -5.00%) Amean 4 0.7913 0.8960 ( -13.23%) Amean 7 0.8190 1.0017 ( -22.30%) Amean 12 0.9560 1.1727 ( -22.66%) Amean 21 1.7587 1.5660 ( 10.96%) Amean 30 2.4477 1.9807 ( 19.08%) Amean 48 3.4573 3.0630 ( 11.41%) Amean 79 4.7903 5.1733 ( -8.00%) Amean 110 6.1370 7.4220 ( -20.94%) Amean 141 7.5777 9.2617 ( -22.22%) Amean 172 9.2280 11.0907 ( -20.18%) Amean 203 10.2793 13.3470 ( -29.84%) Amean 234 11.2410 17.1070 ( -52.18%) Amean 265 12.5970 23.3323 ( -85.22%) Amean 296 17.1540 24.2857 ( -41.57%) 2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM hackbench-process-sockets Amean 1 0.5760 0.4793 ( 16.78%) Amean 4 0.9430 0.9707 ( -2.93%) Amean 7 1.5517 1.8843 ( -21.44%) Amean 12 2.4903 2.7267 ( -9.49%) Amean 21 3.9560 4.2877 ( -8.38%) Amean 30 5.4613 5.8343 ( -6.83%) Amean 48 8.5337 9.2937 ( -8.91%) Amean 79 14.0670 15.2630 ( -8.50%) Amean 110 19.2253 21.2467 ( -10.51%) Amean 141 23.7557 25.8550 ( -8.84%) Amean 172 28.4407 29.7603 ( -4.64%) Amean 203 33.3407 33.9927 ( -1.96%) Amean 234 38.3633 39.1150 ( -1.96%) Amean 265 43.4420 43.8470 ( -0.93%) Amean 296 48.3680 48.9300 ( -1.16%) hackbench-thread-sockets Amean 1 0.6080 0.6493 ( -6.80%) Amean 4 1.0000 1.0513 ( -5.13%) Amean 7 1.6607 2.0260 ( -22.00%) Amean 12 2.7637 2.9273 ( -5.92%) Amean 21 5.0613 4.5153 ( 10.79%) Amean 30 6.3340 6.1140 ( 3.47%) Amean 48 9.0567 9.5577 ( -5.53%) Amean 79 14.5657 15.7983 ( -8.46%) Amean 110 19.6213 21.6333 ( -10.25%) Amean 141 24.1563 26.2697 ( -8.75%) Amean 172 28.9687 30.2187 ( -4.32%) Amean 203 33.9763 34.6970 ( -2.12%) Amean 234 38.8647 39.3207 ( -1.17%) Amean 265 44.0813 44.1507 ( -0.16%) Amean 296 49.2040 49.4330 ( -0.47%) 2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM hackbench-process-sockets Amean 1 0.5027 0.5017 ( 0.20%) Amean 4 1.1053 1.2033 ( -8.87%) Amean 7 1.8760 2.1820 ( -16.31%) Amean 12 2.9053 3.1810 ( -9.49%) Amean 21 4.6777 4.9920 ( -6.72%) Amean 30 6.5180 6.7827 ( -4.06%) Amean 48 10.0710 10.5227 ( -4.48%) Amean 79 16.4250 17.5053 ( -6.58%) Amean 110 22.6203 24.4617 ( -8.14%) Amean 141 28.0967 31.0363 ( -10.46%) Amean 172 34.4030 36.9233 ( -7.33%) Amean 203 40.5933 43.0850 ( -6.14%) Amean 234 46.6477 48.7220 ( -4.45%) Amean 265 53.0530 53.9597 ( -1.71%) Amean 296 59.2760 59.9213 ( -1.09%) hackbench-thread-sockets Amean 1 0.5363 0.5330 ( 0.62%) Amean 4 1.1647 1.2157 ( -4.38%) Amean 7 1.9237 2.2833 ( -18.70%) Amean 12 2.9943 3.3110 ( -10.58%) Amean 21 4.9987 5.1880 ( -3.79%) Amean 30 6.7583 7.0043 ( -3.64%) Amean 48 10.4547 10.8353 ( -3.64%) Amean 79 16.6707 17.6790 ( -6.05%) Amean 110 22.8207 24.4403 ( -7.10%) Amean 141 28.7090 31.0533 ( -8.17%) Amean 172 34.9387 36.8260 ( -5.40%) Amean 203 41.1567 43.0450 ( -4.59%) Amean 234 47.3790 48.5307 ( -2.43%) Amean 265 53.9543 54.6987 ( -1.38%) Amean 296 60.0820 60.2163 ( -0.22%) 1-socket Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz (4 cores, 8 threads), 32 GB RAM hackbench-process-sockets Amean 1 1.4760 1.5773 ( -6.87%) Amean 3 3.9370 4.0910 ( -3.91%) Amean 5 6.6797 6.9357 ( -3.83%) Amean 7 9.3367 9.7150 ( -4.05%) Amean 12 15.7627 16.1400 ( -2.39%) Amean 18 23.5360 23.6890 ( -0.65%) Amean 24 31.0663 31.3137 ( -0.80%) Amean 30 38.7283 39.0037 ( -0.71%) Amean 32 41.3417 41.6097 ( -0.65%) hackbench-thread-sockets Amean 1 1.5250 1.6043 ( -5.20%) Amean 3 4.0897 4.2603 ( -4.17%) Amean 5 6.7760 7.0933 ( -4.68%) Amean 7 9.4817 9.9157 ( -4.58%) Amean 12 15.9610 16.3937 ( -2.71%) Amean 18 23.9543 24.3417 ( -1.62%) Amean 24 31.4400 31.7217 ( -0.90%) Amean 30 39.2457 39.5467 ( -0.77%) Amean 32 41.8267 42.1230 ( -0.71%) 2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM hackbench-process-sockets Amean 1 1.0347 1.0880 ( -5.15%) Amean 4 1.7267 1.8527 ( -7.30%) Amean 7 2.6707 2.8110 ( -5.25%) Amean 12 4.1617 4.3383 ( -4.25%) Amean 21 7.0070 7.2600 ( -3.61%) Amean 30 9.9187 10.2397 ( -3.24%) Amean 48 15.6710 16.3923 ( -4.60%) Amean 79 24.7743 26.1247 ( -5.45%) Amean 110 34.3000 35.9307 ( -4.75%) Amean 141 44.2043 44.8010 ( -1.35%) Amean 172 54.2430 54.7260 ( -0.89%) Amean 192 60.6557 60.9777 ( -0.53%) hackbench-thread-sockets Amean 1 1.0610 1.1353 ( -7.01%) Amean 4 1.7543 1.9140 ( -9.10%) Amean 7 2.7840 2.9573 ( -6.23%) Amean 12 4.3813 4.4937 ( -2.56%) Amean 21 7.3460 7.5350 ( -2.57%) Amean 30 10.2313 10.5190 ( -2.81%) Amean 48 15.9700 16.5940 ( -3.91%) Amean 79 25.3973 26.6637 ( -4.99%) Amean 110 35.1087 36.4797 ( -3.91%) Amean 141 45.8220 46.3053 ( -1.05%) Amean 172 55.4917 55.7320 ( -0.43%) Amean 192 62.7490 62.5410 ( 0.33%) Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Jann Horn <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-06mm: move kvmalloc-related functions to slab.hMatthew Wilcox (Oracle)2-34/+34
Not all files in the kernel should include mm.h. Migrating callers from kmalloc to kvmalloc is easier if the kvmalloc functions are in slab.h. [[email protected]: move the new kvrealloc() also] [[email protected]: drivers/hwmon/occ/p9_sbe.c needs slab.h] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Pekka Enberg <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-28mm: filemap: check if THP has hwpoisoned subpage for PMD page faultYang Shi1-0/+23
When handling shmem page fault the THP with corrupted subpage could be PMD mapped if certain conditions are satisfied. But kernel is supposed to send SIGBUS when trying to map hwpoisoned page. There are two paths which may do PMD map: fault around and regular fault. Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") the thing was even worse in fault around path. The THP could be PMD mapped as long as the VMA fits regardless what subpage is accessed and corrupted. After this commit as long as head page is not corrupted the THP could be PMD mapped. In the regular fault path the THP could be PMD mapped as long as the corrupted page is not accessed and the VMA fits. This loophole could be fixed by iterating every subpage to check if any of them is hwpoisoned or not, but it is somewhat costly in page fault path. So introduce a new page flag called HasHWPoisoned on the first tail page. It indicates the THP has hwpoisoned subpage(s). It is set if any subpage of THP is found hwpoisoned by memory failure and after the refcount is bumped successfully, then cleared when the THP is freed or split. The soft offline path doesn't need this since soft offline handler just marks a subpage hwpoisoned when the subpage is migrated successfully. But shmem THP didn't get split then migrated at all. Link: https://lkml.kernel.org/r/[email protected] Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Suggested-by: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-28Merge tag 'net-5.15-rc8' of ↵Linus Torvalds9-19/+28
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from WiFi (mac80211), and BPF. Current release - regressions: - skb_expand_head: adjust skb->truesize to fix socket memory accounting - mptcp: fix corrupt receiver key in MPC + data + checksum Previous releases - regressions: - multicast: calculate csum of looped-back and forwarded packets - cgroup: fix memory leak caused by missing cgroup_bpf_offline - cfg80211: fix management registrations locking, prevent list corruption - cfg80211: correct false positive in bridge/4addr mode check - tcp_bpf: fix race in the tcp_bpf_send_verdict resulting in reusing previous verdict Previous releases - always broken: - sctp: enhancements for the verification tag, prevent attackers from killing SCTP sessions - tipc: fix size validations for the MSG_CRYPTO type - mac80211: mesh: fix HE operation element length check, prevent out of bound access - tls: fix sign of socket errors, prevent positive error codes being reported from read()/write() - cfg80211: scan: extend RCU protection in cfg80211_add_nontrans_list() - implement ->sock_is_readable() for UDP and AF_UNIX, fix poll() for sockets in a BPF sockmap - bpf: fix potential race in tail call compatibility check resulting in two operations which would make the map incompatible succeeding - bpf: prevent increasing bpf_jit_limit above max - bpf: fix error usage of map_fd and fdget() in generic batch update - phy: ethtool: lock the phy for consistency of results - prevent infinite while loop in skb_tx_hash() when Tx races with driver reconfiguring the queue <> traffic class mapping - usbnet: fixes for bad HW conjured by syzbot - xen: stop tx queues during live migration, prevent UAF - net-sysfs: initialize uid and gid before calling net_ns_get_ownership - mlxsw: prevent Rx stalls under memory pressure" * tag 'net-5.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits) Revert "net: hns3: fix pause config problem after autoneg disabled" mptcp: fix corrupt receiver key in MPC + data + checksum riscv, bpf: Fix potential NULL dereference octeontx2-af: Fix possible null pointer dereference. octeontx2-af: Display all enabled PF VF rsrc_alloc entries. octeontx2-af: Check whether ipolicers exists net: ethernet: microchip: lan743x: Fix skb allocation failure net/tls: Fix flipped sign in async_wait.err assignment net/tls: Fix flipped sign in tls_err_abort() calls net/smc: Correct spelling mistake to TCPF_SYN_RECV net/smc: Fix smc_link->llc_testlink_time overflow nfp: bpf: relax prog rejection for mtu check through max_pkt_offset vmxnet3: do not stop tx queues after netif_device_detach() r8169: Add device 10ec:8162 to driver r8169 ptp: Document the PTP_CLK_MAGIC ioctl number usbnet: fix error return code in usbnet_probe() net: hns3: adjust string spaces of some parameters of tx bd info in debugfs net: hns3: expand buffer len for some debugfs command net: hns3: add more string spaces for dumping packets number of queue info in debugfs net: hns3: fix data endian problem of some functions of debugfs ...
2021-10-28mptcp: fix corrupt receiver key in MPC + data + checksumDavide Caratti1-0/+4
using packetdrill it's possible to observe that the receiver key contains random values when clients transmit MP_CAPABLE with data and checksum (as specified in RFC8684 §3.1). Fix the layout of mptcp_out_options, to avoid using the skb extension copy when writing the MP_CAPABLE sub-option. Fixes: d7b269083786 ("mptcp: shrink mptcp_out_options struct") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/233 Reported-by: Poorva Sonparote <[email protected]> Signed-off-by: Davide Caratti <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-10-28net/tls: Fix flipped sign in tls_err_abort() callsDaniel Jordan1-7/+2
sk->sk_err appears to expect a positive value, a convention that ktls doesn't always follow and that leads to memory corruption in other code. For instance, [kworker] tls_encrypt_done(..., err=<negative error from crypto request>) tls_err_abort(.., err) sk->sk_err = err; [task] splice_from_pipe_feed ... tls_sw_do_sendpage if (sk->sk_err) { ret = -sk->sk_err; // ret is positive splice_from_pipe_feed (continued) ret = actor(...) // ret is still positive and interpreted as bytes // written, resulting in underflow of buf->len and // sd->len, leading to huge buf->offset and bogus // addresses computed in later calls to actor() Fix all tls_err_abort() callers to pass a negative error code consistently and centralize the error-prone sign flip there, throwing in a warning to catch future misuse and uninlining the function so it really does only warn once. Cc: [email protected] Fixes: c46234ebb4d1e ("tls: RX path for ktls") Reported-by: [email protected] Signed-off-by: Daniel Jordan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-10-27Merge tag 'mac80211-for-net-2021-10-27' of ↵Jakub Kicinski1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Two fixes: * bridge vs. 4-addr mode check was wrong * management frame registrations locking was wrong, causing list corruption/crashes ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-10-26Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski6-8/+19
Daniel Borkmann says: ==================== pull-request: bpf 2021-10-26 We've added 12 non-merge commits during the last 7 day(s) which contain a total of 23 files changed, 118 insertions(+), 98 deletions(-). The main changes are: 1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen. 2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang. 3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai. 4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer. 5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun. 6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian. 7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix potential race in tail call compatibility check bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET selftests/bpf: Use recv_timeout() instead of retries net: Implement ->sock_is_readable() for UDP and AF_UNIX skmsg: Extract and reuse sk_msg_is_readable() net: Rename ->stream_memory_read to ->sock_is_readable tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function cgroup: Fix memory leak caused by missing cgroup_bpf_offline bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch() bpf: Prevent increasing bpf_jit_limit above max bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT bpf: Define bpf_jit_alloc_exec_limit for riscv JIT ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-10-26bpf: Fix potential race in tail call compatibility checkToke Høiland-Jørgensen1-2/+5
Lorenzo noticed that the code testing for program type compatibility of tail call maps is potentially racy in that two threads could encounter a map with an unset type simultaneously and both return true even though they are inserting incompatible programs. The race window is quite small, but artificially enlarging it by adding a usleep_range() inside the check in bpf_prog_array_compatible() makes it trivial to trigger from userspace with a program that does, essentially: map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0); pid = fork(); if (pid) { key = 0; value = xdp_fd; } else { key = 1; value = tc_fd; } err = bpf_map_update_elem(map_fd, &key, &value, 0); While the race window is small, it has potentially serious ramifications in that triggering it would allow a BPF program to tail call to a program of a different type. So let's get rid of it by protecting the update with a spinlock. The commit in the Fixes tag is the last commit that touches the code in question. v2: - Use a spinlock instead of an atomic variable and cmpxchg() (Alexei) v3: - Put lock and the members it protects into an embedded 'owner' struct (Daniel) Fixes: 3324b584b6f6 ("ebpf: misc core cleanup") Reported-by: Lorenzo Bianconi <[email protected]> Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-10-26bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NETTejun Heo1-4/+4
bpf_types.h has BPF_MAP_TYPE_INODE_STORAGE and BPF_MAP_TYPE_TASK_STORAGE declared inside #ifdef CONFIG_NET although they are built regardless of CONFIG_NET. So, when CONFIG_BPF_SYSCALL && !CONFIG_NET, they are built without the declarations leading to spurious build failures and not registered to bpf_map_types making them unavailable. Fix it by moving the BPF_MAP_TYPE for the two map types outside of CONFIG_NET. Reported-by: kernel test robot <[email protected]> Fixes: a10787e6d58c ("bpf: Enable task local storage for tracing programs") Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-10-26skmsg: Extract and reuse sk_msg_is_readable()Cong Wang1-0/+1
tcp_bpf_sock_is_readable() is pretty much generic, we can extract it and reuse it for non-TCP sockets. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-10-26net: Rename ->stream_memory_read to ->sock_is_readableCong Wang2-2/+8
The proto ops ->stream_memory_read() is currently only used by TCP to check whether psock queue is empty or not. We need to rename it before reusing it for non-TCP protocols, and adjust the exsiting users accordingly. Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-10-26net: multicast: calculate csum of looped-back and forwarded packetsCyril Strejc1-2/+3
During a testing of an user-space application which transmits UDP multicast datagrams and utilizes multicast routing to send the UDP datagrams out of defined network interfaces, I've found a multicast router does not fill-in UDP checksum into locally produced, looped-back and forwarded UDP datagrams, if an original output NIC the datagrams are sent to has UDP TX checksum offload enabled. The datagrams are sent malformed out of the NIC the datagrams have been forwarded to. It is because: 1. If TX checksum offload is enabled on the output NIC, UDP checksum is not calculated by kernel and is not filled into skb data. 2. dev_loopback_xmit(), which is called solely by ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY unconditionally. 3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except CHECKSUM_COMPLETE"), the ip_summed value is preserved during forwarding. 4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during a packet egress. The minimum fix in dev_loopback_xmit(): 1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the case when the original output NIC has TX checksum offload enabled. The effects are: a) If the forwarding destination interface supports TX checksum offloading, the NIC driver is responsible to fill-in the checksum. b) If the forwarding destination interface does NOT support TX checksum offloading, checksums are filled-in by kernel before skb is submitted to the NIC driver. c) For local delivery, checksum validation is skipped as in the case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary(). 2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It means, for CHECKSUM_NONE, the behavior is unmodified and is there to skip a looped-back packet local delivery checksum validation. Signed-off-by: Cyril Strejc <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-10-25cfg80211: fix management registrations lockingJohannes Berg1-2/+0
The management registrations locking was broken, the list was locked for each wdev, but cfg80211_mgmt_registrations_update() iterated it without holding all the correct spinlocks, causing list corruption. Rather than trying to fix it with fine-grained locking, just move the lock to the wiphy/rdev (still need the list on each wdev), we already need to hold the wdev lock to change it, so there's no contention on the lock in any case. This trivially fixes the bug since we hold one wdev's lock already, and now will hold the lock that protects all lists. Cc: [email protected] Reported-by: Jouni Malinen <[email protected]> Fixes: 6cd536fe62ef ("cfg80211: change internal management frame registration API") Link: https://lore.kernel.org/r/20211025133111.5cf733eab0f4.I7b0abb0494ab712f74e2efcd24bb31ac33f7eee9@changeid Signed-off-by: Johannes Berg <[email protected]>
2021-10-22bpf: Prevent increasing bpf_jit_limit above maxLorenz Bauer1-0/+1
Restrict bpf_jit_limit to the maximum supported by the arch's JIT. Signed-off-by: Lorenz Bauer <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-10-22Merge tag 'acpi-5.15-rc7' of ↵Linus Torvalds1-2/+7
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "These fix two regressions, one related to ACPI power resources management and one that broke ACPI tools compilation. Specifics: - Stop turning off unused ACPI power resources in an unknown state to address a regression introduced during the 5.14 cycle (Rafael Wysocki). - Fix an ACPI tools build issue introduced recently when the minimal stdarg.h was added (Miguel Bernal Marin)" * tag 'acpi-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: PM: Do not turn off power resources in unknown state ACPI: tools: fix compilation error
2021-10-22Merge branch 'acpi-tools'Rafael J. Wysocki1-2/+7
Merge a fix for a recent ACPI tools bild regresson. * acpi-tools: ACPI: tools: fix compilation error
2021-10-21Merge branch 'ucount-fixes-for-v5.15' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ucounts fixes from Eric Biederman: "There has been one very hard to track down bug in the ucount code that we have been tracking since roughly v5.14 was released. Alex managed to find a reliable reproducer a few days ago and then I was able to instrument the code and figure out what the issue was. It turns out the sigqueue_alloc single atomic operation optimization did not play nicely with ucounts multiple level rlimits. It turned out that either sigqueue_alloc or sigqueue_free could be operating on multiple levels and trigger the conditions for the optimization on more than one level at the same time. To deal with that situation I have introduced inc_rlimit_get_ucounts and dec_rlimit_put_ucounts that just focuses on the optimization and the rlimit and ucount changes. While looking into the big bug I found I couple of other little issues so I am including those fixes here as well. When I have time I would very much like to dig into process ownership of the shared signal queue and see if we could pick a single owner for the entire queue so that all of the rlimits can count to that owner. That should entirely remove the need to call get_ucounts and put_ucounts in sigqueue_alloc and sigqueue_free. It is difficult because Linux unlike POSIX supports setuid that works on a single thread" * 'ucount-fixes-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyring ucounts: Proper error handling in set_cred_ucounts ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_creds ucounts: Fix signal ucount refcounting
2021-10-21Merge tag 'net-5.15-rc7' of ↵Linus Torvalds5-9/+12
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, and can. We'll have one more fix for a socket accounting regression, it's still getting polished. Otherwise things look fine. Current release - regressions: - revert "vrf: reset skb conntrack connection on VRF rcv", there are valid uses for previous behavior - can: m_can: fix iomap_read_fifo() and iomap_write_fifo() Current release - new code bugs: - mlx5: e-switch, return correct error code on group creation failure Previous releases - regressions: - sctp: fix transport encap_port update in sctp_vtag_verify - stmmac: fix E2E delay mechanism (in PTP timestamping) Previous releases - always broken: - netfilter: ip6t_rt: fix out-of-bounds read of ipv6_rt_hdr - netfilter: xt_IDLETIMER: fix out-of-bound read caused by lack of init - netfilter: ipvs: make global sysctl read-only in non-init netns - tcp: md5: fix selection between vrf and non-vrf keys - ipv6: count rx stats on the orig netdev when forwarding - bridge: mcast: use multicast_membership_interval for IGMPv3 - can: - j1939: fix UAF for rx_kref of j1939_priv abort sessions on receiving bad messages - isotp: fix TX buffer concurrent access in isotp_sendmsg() fix return error on FC timeout on TX path - ice: fix re-init of RDMA Tx queues and crash if RDMA was not inited - hns3: schedule the polling again when allocation fails, prevent stalls - drivers: add missing of_node_put() when aborting for_each_available_child_of_node() - ptp: fix possible memory leak and UAF in ptp_clock_register() - e1000e: fix packet loss in burst mode on Tiger Lake and later - mlx5e: ipsec: fix more checksum offload issues" * tag 'net-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (75 commits) usbnet: sanity check for maxpacket net: enetc: make sure all traffic classes can send large frames net: enetc: fix ethtool counter name for PM0_TERR ptp: free 'vclock_index' in ptp_clock_release() sfc: Don't use netif_info before net_device setup sfc: Export fibre-specific supported link modes net/mlx5e: IPsec: Fix work queue entry ethernet segment checksum flags net/mlx5e: IPsec: Fix a misuse of the software parser's fields net/mlx5e: Fix vlan data lost during suspend flow net/mlx5: E-switch, Return correct error code on group creation failure net/mlx5: Lag, change multipath and bonding to be mutually exclusive ice: Add missing E810 device ids igc: Update I226_K device ID e1000e: Fix packet loss on Tiger Lake and later e1000e: Separate TGP board type from SPT ptp: Fix possible memory leak in ptp_clock_register() net: stmmac: Fix E2E delay mechanism nfc: st95hf: Make spi remove() callback return zero net: hns3: disable sriov before unload hclge layer net: hns3: fix vf reset workqueue cannot exit ...
2021-10-20net/mlx5: Lag, change multipath and bonding to be mutually exclusiveMaor Dickman1-1/+0
Both multipath and bonding events are changing the HW LAG state independently. Handling one of the features events while the other is already enabled can cause unwanted behavior, for example handling bonding event while multipath enabled will disable the lag and cause multipath to stop working. Fix it by ignoring bonding event while in multipath and ignoring FIB events while in bonding mode. Fixes: 544fe7c2e654 ("net/mlx5e: Activate HW multipath and handle port affinity based on FIB events") Signed-off-by: Maor Dickman <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Reviewed-by: Mark Bloch <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2021-10-20Merge tag 'trace-v5.15-rc5' of ↵Linus Torvalds1-40/+9
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Recursion fix for tracing. While cleaning up some of the tracing recursion protection logic, I discovered a scenario that the current design would miss, and would allow an infinite recursion. Removing an optimization trick that opened the hole fixes the issue and cleans up the code as well" * tag 'trace-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Have all levels of checks prevent recursion
2021-10-18mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem()Sean Christopherson1-1/+1
Check for a NULL page->mapping before dereferencing the mapping in page_is_secretmem(), as the page's mapping can be nullified while gup() is running, e.g. by reclaim or truncation. BUG: kernel NULL pointer dereference, address: 0000000000000068 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 6 PID: 4173897 Comm: CPU 3/KVM Tainted: G W RIP: 0010:internal_get_user_pages_fast+0x621/0x9d0 Code: <48> 81 7a 68 80 08 04 bc 0f 85 21 ff ff 8 89 c7 be RSP: 0018:ffffaa90087679b0 EFLAGS: 00010046 RAX: ffffe3f37905b900 RBX: 00007f2dd561e000 RCX: ffffe3f37905b934 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffe3f37905b900 ... CR2: 0000000000000068 CR3: 00000004c5898003 CR4: 00000000001726e0 Call Trace: get_user_pages_fast_only+0x13/0x20 hva_to_pfn+0xa9/0x3e0 try_async_pf+0xa1/0x270 direct_page_fault+0x113/0xad0 kvm_mmu_page_fault+0x69/0x680 vmx_handle_exit+0xe1/0x5d0 kvm_arch_vcpu_ioctl_run+0xd81/0x1c70 kvm_vcpu_ioctl+0x267/0x670 __x64_sys_ioctl+0x83/0xa0 do_syscall_64+0x56/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Link: https://lkml.kernel.org/r/[email protected] Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Sean Christopherson <[email protected]> Reported-by: Darrick J. Wong <[email protected]> Reported-by: Stephen <[email protected]> Tested-by: Darrick J. Wong <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18elfcore: correct reference to CONFIG_UMLLukas Bulwahn1-1/+1
Commit 6e7b64b9dd6d ("elfcore: fix building with clang") introduces special handling for two architectures, ia64 and User Mode Linux. However, the wrong name, i.e., CONFIG_UM, for the intended Kconfig symbol for User-Mode Linux was used. Although the directory for User Mode Linux is ./arch/um; the Kconfig symbol for this architecture is called CONFIG_UML. Luckily, ./scripts/checkkconfigsymbols.py warns on non-existing configs: UM Referencing files: include/linux/elfcore.h Similar symbols: UML, NUMA Correct the name of the config to the intended one. [[email protected]: fix um/x86_64, per Catalin] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 6e7b64b9dd6d ("elfcore: fix building with clang") Signed-off-by: Lukas Bulwahn <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Barret Rhoden <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm/migrate: fix CPUHP state to update node demotion orderHuang Ying1-0/+4
The node demotion order needs to be updated during CPU hotplug. Because whether a NUMA node has CPU may influence the demotion order. The update function should be called during CPU online/offline after the node_states[N_CPU] has been updated. That is done in CPUHP_AP_ONLINE_DYN during CPU online and in CPUHP_MM_VMSTAT_DEAD during CPU offline. But in commit 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events"), the function to update node demotion order is called in CPUHP_AP_ONLINE_DYN during CPU online/offline. This doesn't satisfy the order requirement. For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0 and P2, P3 in S1), the demotion order is - S0 -> NUMA_NO_NODE - S1 -> NUMA_NO_NODE After P2 and P3 is offlined, because S1 has no CPU now, the demotion order should have been changed to - S0 -> S1 - S1 -> NO_NODE but it isn't changed, because the order updating callback for CPU hotplug doesn't see the new nodemask. After that, if P1 is offlined, the demotion order is changed to the expected order as above. So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the update function on them. Link: https://lkml.kernel.org/r/[email protected] Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events") Signed-off-by: "Huang, Ying" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Xu <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Keith Busch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18mm/migrate: add CPU hotplug to demotion #ifdefDave Hansen1-1/+4
Once upon a time, the node demotion updates were driven solely by memory hotplug events. But now, there are handlers for both CPU and memory hotplug. However, the #ifdef around the code checks only memory hotplug. A system that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU hotplug events. Update the #ifdef around the common code. Add memory and CPU-specific #ifdefs for their handlers. These memory/CPU #ifdefs avoid unused function warnings when their Kconfig option is off. [[email protected]: rework hotplug_memory_notifier() stub] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events") Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Xu <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-10-18tracing: Have all levels of checks prevent recursionSteven Rostedt (VMware)1-40/+9
While writing an email explaining the "bit = 0" logic for a discussion on making ftrace_test_recursion_trylock() disable preemption, I discovered a path that makes the "not do the logic if bit is zero" unsafe. The recursion logic is done in hot paths like the function tracer. Thus, any code executed causes noticeable overhead. Thus, tricks are done to try to limit the amount of code executed. This included the recursion testing logic. Having recursion testing is important, as there are many paths that can end up in an infinite recursion cycle when tracing every function in the kernel. Thus protection is needed to prevent that from happening. Because it is OK to recurse due to different running context levels (e.g. an interrupt preempts a trace, and then a trace occurs in the interrupt handler), a set of bits are used to know which context one is in (normal, softirq, irq and NMI). If a recursion occurs in the same level, it is prevented*. Then there are infrastructure levels of recursion as well. When more than one callback is attached to the same function to trace, it calls a loop function to iterate over all the callbacks. Both the callbacks and the loop function have recursion protection. The callbacks use the "ftrace_test_recursion_trylock()" which has a "function" set of context bits to test, and the loop function calls the internal trace_test_and_set_recursion() directly, with an "internal" set of bits. If an architecture does not implement all the features supported by ftrace then the callbacks are never called directly, and the loop function is called instead, which will implement the features of ftrace. Since both the loop function and the callbacks do recursion protection, it was seemed unnecessary to do it in both locations. Thus, a trick was made to have the internal set of recursion bits at a more significant bit location than the function bits. Then, if any of the higher bits were set, the logic of the function bits could be skipped, as any new recursion would first have to go through the loop function. This is true for architectures that do not support all the ftrace features, because all functions being traced must first go through the loop function before going to the callbacks. But this is not true for architectures that support all the ftrace features. That's because the loop function could be called due to two callbacks attached to the same function, but then a recursion function inside the callback could be called that does not share any other callback, and it will be called directly. i.e. traced_function_1: [ more than one callback tracing it ] call loop_func loop_func: trace_recursion set internal bit call callback callback: trace_recursion [ skipped because internal bit is set, return 0 ] call traced_function_2 traced_function_2: [ only traced by above callback ] call callback callback: trace_recursion [ skipped because internal bit is set, return 0 ] call traced_function_2 [ wash, rinse, repeat, BOOM! out of shampoo! ] Thus, the "bit == 0 skip" trick is not safe, unless the loop function is call for all functions. Since we want to encourage architectures to implement all ftrace features, having them slow down due to this extra logic may encourage the maintainers to update to the latest ftrace features. And because this logic is only safe for them, remove it completely. [*] There is on layer of recursion that is allowed, and that is to allow for the transition between interrupt context (normal -> softirq -> irq -> NMI), because a trace may occur before the context update is visible to the trace recursion logic. Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Cc: Linus Torvalds <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Helge Deller <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Albert Ou <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Miroslav Benes <[email protected]> Cc: Joe Lawrence <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: "Peter Zijlstra (Intel)" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Jisheng Zhang <[email protected]> Cc: =?utf-8?b?546L6LSH?= <[email protected]> Cc: Guo Ren <[email protected]> Cc: [email protected] Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-10-18ucounts: Fix signal ucount refcountingEric W. Biederman1-0/+2
In commit fda31c50292a ("signal: avoid double atomic counter increments for user accounting") Linus made a clever optimization to how rlimits and the struct user_struct. Unfortunately that optimization does not work in the obvious way when moved to nested rlimits. The problem is that the last decrement of the per user namespace per user sigpending counter might also be the last decrement of the sigpending counter in the parent user namespace as well. Which means that simply freeing the leaf ucount in __free_sigqueue is not enough. Maintain the optimization and handle the tricky cases by introducing inc_rlimit_get_ucounts and dec_rlimit_put_ucounts. By moving the entire optimization into functions that perform all of the work it becomes possible to ensure that every level is handled properly. The new function inc_rlimit_get_ucounts returns 0 on failure to increment the ucount. This is different than inc_rlimit_ucounts which increments the ucounts and returns LONG_MAX if the ucount counter has exceeded it's maximum or it wrapped (to indicate the counter needs to decremented). I wish we had a single user to account all pending signals to across all of the threads of a process so this complexity was not necessary Cc: [email protected] Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133 Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133 Reviewed-by: Alexey Gladkov <[email protected]> Tested-by: Rune Kleveland <[email protected]> Tested-by: Yu Zhao <[email protected]> Tested-by: Jordan Glover <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-10-18mctp: Be explicit about struct sockaddr_mctp paddingJeremy Kerr1-0/+2
We currently have some implicit padding in struct sockaddr_mctp. This patch makes this padding explicit, and ensures we have consistent layout on platforms with <32bit alignmnent. Fixes: 60fc63981693 ("mctp: Add sockaddr_mctp to uapi") Signed-off-by: Jeremy Kerr <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-10-18mctp: unify sockaddr_mctp typesJeremy Kerr2-3/+4
Use the more precise __kernel_sa_family_t for smctp_family, to match struct sockaddr. Also, use an unsigned int for the network member; negative networks don't make much sense. We're already using unsigned for mctp_dev and mctp_skb_cb, but need to change mctp_sock to suit. Fixes: 60fc63981693 ("mctp: Add sockaddr_mctp to uapi") Signed-off-by: Jeremy Kerr <[email protected]> Acked-by: Eugene Syromiatnikov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-10-17Merge tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-blockLinus Torvalds2-10/+10
Pull block fixes from Jens Axboe: "Bigger than usual for this point in time, the majority is fixing some issues around BDI lifetimes with the move from the request_queue to the disk in this release. In detail: - Series on draining fs IO for del_gendisk() (Christoph) - NVMe pull request via Christoph: - fix the abort command id (Keith Busch) - nvme: fix per-namespace chardev deletion (Adam Manzanares) - brd locking scope fix (Tetsuo) - BFQ fix (Paolo)" * tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block: block, bfq: reset last_bfqq_created on group change block: warn when putting the final reference on a registered disk brd: reduce the brd_devices_mutex scope kyber: avoid q->disk dereferences in trace points block: keep q_usage_counter in atomic mode after del_gendisk block: drain file system I/O on del_gendisk block: split bio_queue_enter from blk_queue_enter block: factor out a blk_try_enter_queue helper block: call submit_bio_checks under q_usage_counter nvme: fix per-namespace chardev deletion block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs nvme-pci: Fix abort command id
2021-10-17Merge tag 'char-misc-5.15-rc6' of ↵Linus Torvalds1-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are some small char/misc driver fixes for 5.15-rc6 for reported issues that include: - habanalabs driver fixes - mei driver fixes and new ids - fpga new device ids - MAINTAINER file updates for fpga subsystem - spi module id table additions and fixes - fastrpc locking fixes - nvmem driver fix All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: eeprom: 93xx46: fix MODULE_DEVICE_TABLE nvmem: Fix shift-out-of-bound (UBSAN) with byte size cells mei: hbm: drop hbm responses on early shutdown mei: me: add Ice Lake-N device id. eeprom: 93xx46: Add SPI device ID table eeprom: at25: Add SPI ID table misc: HI6421V600_IRQ should depend on HAS_IOMEM misc: fastrpc: Add missing lock before accessing find_vma() cb710: avoid NULL pointer subtraction misc: gehc: Add SPI ID table MAINTAINERS: Drop outdated FPGA Manager website MAINTAINERS: Add Hao and Yilun as maintainers habanalabs: fix resetting args in wait for CS IOCTL fpga: ice40-spi: Add SPI device ID table
2021-10-15kyber: avoid q->disk dereferences in trace pointsChristoph Hellwig1-10/+9
q->disk becomes invalid after the gendisk is removed. Work around this by caching the dev_t for the tracepoints. The real fix would be to properly tear down the I/O schedulers with the gendisk, but that is a much more invasive change. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Tested-by: Yi Zhang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-15block: drain file system I/O on del_gendiskChristoph Hellwig1-0/+1
Instead of delaying draining of file system I/O related items like the blk-qos queues, the integrity read workqueue and timeouts only when the request_queue is removed, do that when del_gendisk is called. This is important for SCSI where the upper level drivers that control the gendisk are separate entities, and the disk can be freed much earlier than the request_queue, or can even be unbound without tearing down the queue. Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk") Reported-by: Ming Lei <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Tested-by: Yi Zhang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-15Merge tag 'spi-fix-v5.15-rc5' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "A few small fixes. Mostly driver specific but there's one in the core which fixes a deadlock when adding devices on spi-mux that's triggered because spi-mux is a SPI device which is itself a SPI controller and so can instantiate devices when registered. We were using a global lock to protect against reusing chip selects but they're a per controller thing so moving the lock per controller resolves that" * tag 'spi-fix-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi-mux: Fix false-positive lockdep splats spi: Fix deadlock when adding SPI controllers on SPI buses spi: bcm-qspi: clear MSPI spifie interrupt during probe spi: spi-nxp-fspi: don't depend on a specific node name erratum workaround spi: mediatek: skip delays if they are 0 spi: atmel: Fix PDC transfer setup bug spi: spidev: Add SPI ID table spi: Use 'flash' node name instead of 'spi-flash' in example
2021-10-15tcp: md5: Allow MD5SIG_FLAG_IFINDEX with ifindex=0Leonard Crestez1-2/+3
Multiple VRFs are generally meant to be "separate" but right now md5 keys for the default VRF also affect connections inside VRFs if the IP addresses happen to overlap. So far the combination of TCP_MD5SIG_FLAG_IFINDEX with tcpm_ifindex == 0 was an error, accept this to mean "key only applies to default VRF". This is what applications using VRFs for traffic separation want. Signed-off-by: Leonard Crestez <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-10-15sctp: fix transport encap_port update in sctp_vtag_verifyXin Long1-3/+3
transport encap_port update should be updated when sctp_vtag_verify() succeeds, namely, returns 1, not returns 0. Correct it in this patch. While at it, also fix the indentation. Fixes: a1dd2cf2f1ae ("sctp: allow changing transport encap_port by peer packets") Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-10-14Merge tag 'net-5.15-rc6' of ↵Linus Torvalds6-80/+90
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Quite calm. The noisy DSA driver (embedded switches) changes, and adjustment to IPv6 IOAM behavior add to diffstat's bottom line but are not scary. Current release - regressions: - af_unix: rename UNIX-DGRAM to UNIX to maintain backwards compatibility - procfs: revert "add seq_puts() statement for dev_mcast", minor format change broke user space Current release - new code bugs: - dsa: fix bridge_num not getting cleared after ports leaving the bridge, resource leak - dsa: tag_dsa: send packets with TX fwd offload from VLAN-unaware bridges using VID 0, prevent packet drops if pvid is removed - dsa: mv88e6xxx: keep the pvid at 0 when VLAN-unaware, prevent HW getting confused about station to VLAN mapping Previous releases - regressions: - virtio-net: fix for skb_over_panic inside big mode - phy: do not shutdown PHYs in READY state - dsa: mv88e6xxx: don't use PHY_DETECT on internal PHY's, fix link LED staying lit after ifdown - mptcp: fix possible infinite wait on recvmsg(MSG_WAITALL) - mqprio: Correct stats in mqprio_dump_class_stats() - ice: fix deadlock for Tx timestamp tracking flush - stmmac: fix feature detection on old hardware Previous releases - always broken: - sctp: account stream padding length for reconf chunk - icmp: fix icmp_ext_echo_iio parsing in icmp_build_probe() - isdn: cpai: check ctr->cnr to avoid array index out of bound - isdn: mISDN: fix sleeping function called from invalid context - nfc: nci: fix potential UAF of rf_conn_info object - dsa: microchip: prevent ksz_mib_read_work from kicking back in after it's canceled in .remove and crashing - dsa: mv88e6xxx: isolate the ATU databases of standalone and bridged ports - dsa: sja1105, ocelot: break circular dependency between switch and tag drivers - dsa: felix: improve timestamping in presence of packe loss - mlxsw: thermal: fix out-of-bounds memory accesses Misc: - ipv6: ioam: move the check for undefined bits to improve interoperability" * tag 'net-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (60 commits) icmp: fix icmp_ext_echo_iio parsing in icmp_build_probe MAINTAINERS: Update the devicetree documentation path of imx fec driver sctp: account stream padding length for reconf chunk mlxsw: thermal: Fix out-of-bounds memory accesses ethernet: s2io: fix setting mac address during resume NFC: digital: fix possible memory leak in digital_in_send_sdd_req() NFC: digital: fix possible memory leak in digital_tg_listen_mdaa() nfc: fix error handling of nfc_proto_register() Revert "net: procfs: add seq_puts() statement for dev_mcast" net: encx24j600: check error in devm_regmap_init_encx24j600 net: korina: select CRC32 net: arc: select CRC32 net: dsa: felix: break at first CPU port during init and teardown net: dsa: tag_ocelot_8021q: fix inability to inject STP BPDUs into BLOCKING ports net: dsa: felix: purge skb from TX timestamping queue if it cannot be sent net: dsa: tag_ocelot_8021q: break circular dependency with ocelot switch lib net: dsa: tag_ocelot: break circular dependency with ocelot switch lib driver net: mscc: ocelot: cross-check the sequence id from the timestamp FIFO with the skb PTP header net: mscc: ocelot: deny TX timestamping of non-PTP packets net: mscc: ocelot: warn when a PTP IRQ is raised for an unknown skb ...
2021-10-14Merge tag 'sound-5.15-rc6' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "This contains quite a few device-specific fixes for usual HD- and USB-audio in addition to a couple of ALSA core fixes (a UAF fix in sequencer and a fix for a misplaced PCM 32bit compat ioctl). Nothing really stands out" * tag 'sound-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: usb-audio: Add quirk for VF0770 ALSA: hda: avoid write to STATESTS if controller is in reset ALSA: hda/realtek: Fix the mic type detection issue for ASUS G551JW ALSA: pcm: Workaround for a wrong offset in SYNC_PTR compat ioctl ALSA: hda/realtek: Fix for quirk to enable speaker output on the Lenovo 13s Gen2 ALSA: hda: intel: Allow repeatedly probing on codec configuration errors ALSA: hda/realtek: Add quirk for TongFang PHxTxX1 ALSA: hda/realtek - ALC236 headset MIC recording issue ALSA: usb-audio: Enable rate validation for Scarlett devices ALSA: hda/realtek: Add quirk for Clevo X170KM-G ALSA: hda/realtek: Complete partial device name to avoid ambiguity ALSA: hda - Enable headphone mic on Dell Latitude laptops with ALC3254 ALSA: seq: Fix a potential UAF by wrong private_free call order ALSA: hda/realtek: Enable 4-speaker output for Dell Precision 5560 laptop ALSA: usb-audio: Fix a missing error check in scarlett gen2 mixer
2021-10-14spi: Fix deadlock when adding SPI controllers on SPI busesMark Brown1-0/+3
Currently we have a global spi_add_lock which we take when adding new devices so that we can check that we're not trying to reuse a chip select that's already controlled. This means that if the SPI device is itself a SPI controller and triggers the instantiation of further SPI devices we trigger a deadlock as we try to register and instantiate those devices while in the process of doing so for the parent controller and hence already holding the global spi_add_lock. Since we only care about concurrency within a single SPI bus move the lock to be per controller, avoiding the deadlock. This can be easily triggered in the case of spi-mux. Reported-by: Uwe Kleine-König <[email protected]> Signed-off-by: Mark Brown <[email protected]>
2021-10-13Merge tag 'mlx5-fixes-2021-10-12' of ↵Jakub Kicinski1-2/+8
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5 fixes 2021-10-12 * tag 'mlx5-fixes-2021-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: Fix division by 0 in mlx5e_select_queue for representors net/mlx5e: Mutually exclude RX-FCS and RX-port-timestamp net/mlx5e: Switchdev representors are not vlan challenged net/mlx5e: Fix memory leak in mlx5_core_destroy_cq() error path net/mlx5e: Allow only complete TXQs partition in MQPRIO channel mode net/mlx5: Fix cleanup of bridge delayed work ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>