blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2022-03-22	mm: sparsemem: move vmemmap related to HugeTLB to ↵	Muchun Song	1	-0/+2
	CONFIG_HUGETLB_PAGE_FREE_VMEMMAP The vmemmap_remap_free/alloc are relevant to HugeTLB, so move those functiongs to the scope of CONFIG_HUGETLB_PAGE_FREE_VMEMMAP. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key	Muchun Song	2	-8/+14
	The page_fixed_fake_head() is used throughout memory management and the conditional check requires checking a global variable, although the overhead of this check may be small, it increases when the memory cache comes under pressure. Also, the global variable will not be modified after system boot, so it is very appropriate to use static key machanism. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page	Muchun Song	1	-4/+74
	Patch series "Free the 2nd vmemmap page associated with each HugeTLB page", v7. This series can minimize the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks. The main implementation and details can refer to the commit log of patch 1. In this series, I have changed the following four helpers, the following table shows the impact of the overhead of those helpers. +------------------+-----------------------+ \| APIs \| head page \| tail page \| +------------------+-----------+-----------+ \| PageHead() \| Y \| N \| +------------------+-----------+-----------+ \| PageTail() \| Y \| N \| +------------------+-----------+-----------+ \| PageCompound() \| N \| N \| +------------------+-----------+-----------+ \| compound_head() \| Y \| N \| +------------------+-----------+-----------+ Y: Overhead is increased. N: Overhead is _NOT_ increased. It shows that the overhead of those helpers on a tail page don't change between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off". But the overhead on a head page will be increased when "hugetlb_free_vmemmap=on" (except PageCompound()). So I believe that Matthew Wilcox's folio series will help with this. The users of PageHead() and PageTail() are much less than compound_head() and most users of PageTail() are VM_BUG_ON(), so I have done some tests about the overhead of compound_head() on head pages. I have tested the overhead of calling compound_head() on a head page, which is 2.11ns (Measure the call time of 10 million times compound_head(), and then average). For a head page whose address is not aligned with PAGE_SIZE or a non-compound page, the overhead of compound_head() is 2.54ns which is increased by 20%. For a head page whose address is aligned with PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by 40%. Most pages are the former. I do not think the overhead is significant since the overhead of compound_head() itself is low. This patch (of 5): This patch minimizes the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB (2MB type). After the feature of "Free sonme vmemmap pages of HugeTLB page" is enabled, the mapping of the vmemmap addresses associated with a 2MB HugeTLB page becomes the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head \| \| \| 0 \| -------------> \| 0 \| \| \| +-----------+ +-----------+ \| \| \| 1 \| -------------> \| 1 \| \| \| +-----------+ +-----------+ \| \| \| 2 \| ----------------^ ^ ^ ^ ^ ^ \| \| +-----------+ \| \| \| \| \| \| \| \| 3 \| ------------------+ \| \| \| \| \| \| +-----------+ \| \| \| \| \| \| \| 4 \| --------------------+ \| \| \| \| 2MB \| +-----------+ \| \| \| \| \| \| 5 \| ----------------------+ \| \| \| \| +-----------+ \| \| \| \| \| 6 \| ------------------------+ \| \| \| +-----------+ \| \| \| \| 7 \| --------------------------+ \| \| +-----------+ \| \| \| \| \| \| +-----------+ As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and remaped. However, the 2nd vmemmap page frame is also can be freed to the buddy allocator, then we can change the mapping from the figure above to the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head \| \| \| 0 \| -------------> \| 0 \| \| \| +-----------+ +-----------+ \| \| \| 1 \| ---------------^ ^ ^ ^ ^ ^ ^ \| \| +-----------+ \| \| \| \| \| \| \| \| \| 2 \| -----------------+ \| \| \| \| \| \| \| +-----------+ \| \| \| \| \| \| \| \| 3 \| -------------------+ \| \| \| \| \| \| +-----------+ \| \| \| \| \| \| \| 4 \| ---------------------+ \| \| \| \| 2MB \| +-----------+ \| \| \| \| \| \| 5 \| -----------------------+ \| \| \| \| +-----------+ \| \| \| \| \| 6 \| -------------------------+ \| \| \| +-----------+ \| \| \| \| 7 \| ---------------------------+ \| \| +-----------+ \| \| \| \| \| \| +-----------+ After we do this, all tail vmemmap pages (1-7) are mapped to the head vmemmap page frame (0). In other words, there are more than one page struct with PG_head associated with each HugeTLB page. We __know__ that there is only one head page struct, the tail page structs with PG_head are fake head page structs. We need an approach to distinguish between those two different types of page structs so that compound_head(), PageHead() and PageTail() can work properly if the parameter is the tail page struct but with PG_head. The following code snippet describes how to distinguish between real and fake head page struct. if (test_bit(PG_head, &page->flags)) { unsigned long head = READ_ONCE(page[1].compound_head); if (head & 1) { if (head == (unsigned long)page + 1) ==> head page struct else ==> tail page struct } else ==> head page struct } We can safely access the field of the @page[1] with PG_head because the @page is a compound page composed with at least two contiguous pages. [[email protected]: restore lost comment changes] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Chen Huang <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Qi Zheng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm, fault-injection: declare should_fail_alloc_page()	Vlastimil Babka	1	-0/+2
	The mm/ directory can almost fully be built with W=1, which would help in local development. One remaining issue is missing prototype for should_fail_alloc_page(). Thus add it next to the should_failslab() prototype. Note the previous attempt by commit f7173090033c ("mm/page_alloc: make should_fail_alloc_page() static") had to be reverted by commit 54aa386661fe as it caused an unresolved symbol error with CONFIG_DEBUG_INFO_BTF=y Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/memory-failure.c: fix race with changing page compound again	Miaohe Lin	2	-0/+2
	Patch series "A few fixup patches for memory failure", v2. This series contains a few patches to fix the race with changing page compound page, make non-LRU movable pages unhandlable and so on. More details can be found in the respective changelogs. There is a race window where we got the compound_head, the hugetlb page could be freed to buddy, or even changed to another compound page just before we try to get hwpoison page. Think about the below race window: CPU 1 CPU 2 memory_failure_hugetlb struct page head = compound_head(p); hugetlb page might be freed to buddy, or even changed to another compound page. get_hwpoison_page -- page is not what we want now... If this race happens, just bail out. Also MF_MSG_DIFFERENT_PAGE_SIZE is introduced to record this event. [[email protected]: s@/@/@, per Naoya Horiguchi] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Tony Luck <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	arch/x86/mm/numa: Do not initialize nodes twice	Oscar Salvador	1	-1/+0
	On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA nodes could be allocated at three different places. - numa_register_memblks - init_cpu_to_node - init_gi_nodes All these calls happen at setup_arch, and have the following order: setup_arch ... x86_numa_init numa_init numa_register_memblks ... init_cpu_to_node init_memory_less_node alloc_node_data free_area_init_memoryless_node init_gi_nodes init_memory_less_node alloc_node_data free_area_init_memoryless_node numa_register_memblks() is only interested in those nodes which have memory, so it skips over any memoryless node it founds. Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node() and init_gi_nodes(), which initialize any memoryless node we might have that have either CPU or Initiator affinity, meaning we allocate pg_data_t struct for them and we mark them as ONLINE. So far so good, but the thing is that after ("mm: handle uninitialized numa nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(), meaning we have a picture like the following: setup_arch x86_numa_init numa_init numa_register_memblks <-- allocate non-memoryless node x86_init.paging.pagetable_init ... free_area_init free_area_init_memoryless <-- allocate memoryless node init_cpu_to_node alloc_node_data <-- allocate memoryless node with CPU free_area_init_memoryless_node init_gi_nodes alloc_node_data <-- allocate memoryless node with Initiator free_area_init_memoryless_node free_area_init() already allocates all possible NUMA nodes, but init_cpu_to_node() and init_gi_nodes() are clueless about that, so they go ahead and allocate a new pg_data_t struct without checking anything, meaning we end up allocating twice. It should be mad clear that this only happens in the case where memoryless NUMA node happens to have a CPU/Initiator affinity. So get rid of init_memory_less_node() and just set the node online. Note that setting the node online is needed, otherwise we choke down the chain when bringup_nonboot_cpus() ends up calling __try_online_node()->register_one_node()->... and we blow up in bus_add_device(). As can be seen here: BUG: kernel NULL pointer dereference, address: 0000000000000060 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4 RIP: 0010:bus_add_device+0x5a/0x140 Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004 RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19 RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400 RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98 R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0 R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0 Call Trace: device_add+0x4c0/0x910 __register_one_node+0x97/0x2d0 __try_online_node+0x85/0xc0 try_online_node+0x25/0x40 cpu_up+0x4f/0x100 bringup_nonboot_cpus+0x4f/0x60 smp_init+0x26/0x79 kernel_init_freeable+0x130/0x2f1 kernel_init+0x17/0x150 ret_from_fork+0x22/0x30 The reason is simple, by the time bringup_nonboot_cpus() gets called, we did not register the node_subsys bus yet, so we crash when bus_add_device() tries to dereference bus()->p. The following shows the order of the calls: kernel_init_freeable smp_init bringup_nonboot_cpus ... bus_add_device() <- we did not register node_subsys yet do_basic_setup do_initcalls postcore_initcall(register_node_type); register_node_type subsys_system_register subsys_register bus_register <- register node_subsys bus Why setting the node online saves us then? Well, simply because __try_online_node() backs off when the node is online, meaning we do not end up calling register_one_node() in the first place. This is subtle, broken and deserves a deep analysis and thought about how to put this into shape, but for now let us have this easy fix for the leaking memory issue. [[email protected]: add comments] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully") Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Rafael Aquini <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Wei Yang <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Alexey Makhalov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: enforce pageblock_order < MAX_ORDER	David Hildenbrand	2	-4/+6
	Some places in the kernel don't really expect pageblock_order >= MAX_ORDER, and it looks like this is only possible in corner cases: 1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order pages via __free_pages_core(), which cannot possibly work. 2) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger pageblock_order, we could have a single pageblock partially managed by two zones. 3) compaction code runs into __fragmentation_index() with order >= MAX_ORDER, when checking WARN_ON_ONCE(order >= MAX_ORDER). [1] 4) mm/page_reporting.c won't be reporting any pages with default page_reporting_order == pageblock_order, as we'll be skipping the reporting loop inside page_reporting_process_zone(). 5) __rmqueue_fallback() will never be able to steal with ALLOC_NOFRAGMENT. pageblock_order >= MAX_ORDER is weird either way: it's a pure optimization for making alloc_contig_range(), as used for allcoation of gigantic pages, a little more reliable to succeed. However, if there is demand for somewhat reliable allocation of gigantic pages, affected setups should be using CMA or boottime allocations instead. So let's make sure that pageblock_order < MAX_ORDER and simplify. [1] https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Frank Rowand <[email protected]> Cc: John Garry via iommu <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Rob Herring <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	cma: factor out minimum alignment requirement	David Hildenbrand	1	-0/+9
	Patch series "mm: enforce pageblock_order < MAX_ORDER". Having pageblock_order >= MAX_ORDER seems to be able to happen in corner cases and some parts of the kernel are not prepared for it. For example, Aneesh has shown [1] that such kernels can be compiled on ppc64 with 64k base pages by setting FORCE_MAX_ZONEORDER=8, which will run into a WARN_ON_ONCE(order >= MAX_ORDER) in comapction code right during boot. We can get pageblock_order >= MAX_ORDER when the default hugetlb size is bigger than the maximum allocation granularity of the buddy, in which case we are no longer talking about huge pages but instead gigantic pages. Having pageblock_order >= MAX_ORDER can only make alloc_contig_range() of such gigantic pages more likely to succeed. Reliable use of gigantic pages either requires boot time allcoation or CMA, no need to overcomplicate some places in the kernel to optimize for corner cases that are broken in other areas of the kernel. This patch (of 2): Let's enforce pageblock_order < MAX_ORDER and simplify. Especially patch #1 can be regarded a cleanup before: [PATCH v5 0/6] Use pageblock_order for cma and alloc_contig_range alignment. [2] [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Zi Yan <[email protected]> Acked-by: Rob Herring <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Frank Rowand <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: John Garry via iommu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/mmzone.h: remove unused macros	Miaohe Lin	1	-7/+0
	Remove pgdat_page_nr, nid_page_nr and NODE_MEM_MAP. They are unused now. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: page_alloc: avoid merging non-fallbackable pageblocks with others	Zi Yan	1	-0/+11
	This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance. It prepares for the upcoming removal of the MAX_ORDER-1 alignment requirement for CMA and alloc_contig_range(). MIGRATE_HIGHATOMIC should not merge with other migratetypes like MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too. Remove MIGRATE_CMA and MIGRATE_ISOLATE from fallbacks list, since they are never used. [1] https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Mike Rapoport <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/vmalloc: fix comments about vmap_area struct	Bang Li	1	-2/+2
	The vmap_area_root should be in the "busy" tree and the free_vmap_area_root should be in the "free" tree. Link: https://lkml.kernel.org/r/[email protected] Fixes: 688fcbfc06e4 ("mm/vmalloc: modify struct vmap_area to reduce its size") Signed-off-by: Bang Li <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Pengfei Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: merge pte_mkhuge() call into arch_make_huge_pte()	Anshuman Khandual	1	-1/+1
	Each call into pte_mkhuge() is invariably followed by arch_make_huge_pte(). Instead arch_make_huge_pte() can accommodate pte_mkhuge() at the beginning. This updates generic fallback stub for arch_make_huge_pte() and available platforms definitions. This makes huge pte creation much cleaner and easier to follow. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Reviewed-by: Christophe Leroy <[email protected]> Acked-by: Mike Kravetz <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: memcontrol: rename memcg_cache_id to memcg_kmem_id	Muchun Song	1	-2/+2
	The memcg_cache_id() introduced by commit 2633d7a02823 ("slab/slub: consider a memcg parameter in kmem_create_cache") is used to index in the kmem_cache->memcg_params->memcg_caches array. Since kmem_cache->memcg_params.memcg_caches has been removed by commit 9855609bde03 ("mm: memcg/slab: use a single set of kmem_caches for all accounted allocations"). So the name does not need to reflect cache related. Just rename it to memcg_kmem_id. And it can reflect kmem related. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: list_lru: rename list_lru_per_memcg to list_lru_memcg	Muchun Song	1	-1/+1
	The name of list_lru_memcg was occupied before and became free since last commit. Rename list_lru_per_memcg to list_lru_memcg since the name is brief. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: list_lru: replace linear array with xarray	Muchun Song	2	-34/+2
	If we run 10k containers in the system, the size of the list_lru_memcg->lrus can be ~96KB per list_lru. When we decrease the number containers, the size of the array will not be shrinked. It is not scalable. The xarray is a good choice for this case. We can save a lot of memory when there are tens of thousands continers in the system. If we use xarray, we also can remove the logic code of resizing array, which can simplify the code. [[email protected]: remove unused local] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus	Muchun Song	1	-1/+1
	The purpose of the memcg_drain_all_list_lrus() is list_lrus reparenting. It is very similar to memcg_reparent_objcgs(). Rename it to memcg_reparent_list_lrus() so that the name can more consistent with memcg_reparent_objcgs(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: list_lru: allocate list_lru_one only when needed	Muchun Song	1	-3/+4
	In our server, we found a suspected memory leak problem. The kmalloc-32 consumes more than 6GB of memory. Other kmem_caches consume less than 2GB memory. After our in-depth analysis, the memory consumption of kmalloc-32 slab cache is the cause of list_lru_one allocation. crash> p memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574 memcg_nr_cache_ids is very large and memory consumption of each list_lru can be calculated with the following formula. num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) There are 4 numa nodes in our system, so each list_lru consumes ~3MB. crash> list super_blocks \| wc -l 952 Every mount will register 2 list lrus, one is for inode, another is for dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 MB (~5.6GB). But the number of memory cgroup is less than 500. So I guess more than 12286 containers have been deployed on this machine (I do not know why there are so many containers, it may be a user's bug or the user really want to do that). And memcg_nr_cache_ids has not been reduced to a suitable value. This can waste a lot of memory. Now the infrastructure for dynamic list_lru_one allocation is ready, so remove statically allocated memory code to save memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	xarray: use kmem_cache_alloc_lru to allocate xa_node	Muchun Song	2	-2/+12
	The workingset will add the xa_node to the shadow_nodes list. So the allocation of xa_node should be done by kmem_cache_alloc_lru(). Using xas_set_lru() to pass the list_lru which we want to insert xa_node into to set up the xa_node reclaim context correctly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	fs: introduce alloc_inode_sb() to allocate filesystems specific inode	Muchun Song	1	-0/+11
	The allocated inode cache is supposed to be added to its memcg list_lru which should be allocated as well in advance. That can be done by kmem_cache_alloc_lru() which allocates object and list_lru. The file systems is main user of it. So introduce alloc_inode_sb() to allocate file system specific inodes and set up the inode reclaim context properly. The file system is supposed to use alloc_inode_sb() to allocate inodes. In later patches, we will convert all users to the new API. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: introduce kmem_cache_alloc_lru	Muchun Song	3	-0/+21
	We currently allocate scope for every memcg to be able to tracked on every superblock instantiated in the system, regardless of whether that superblock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confined to just a small subset of the total number of superblocks that instantiated at any given point in time. For these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that adding the memcg to the list_lru at the first insert. So introduce kmem_cache_alloc_lru to allocate objects and its list_lru. In the later patch, we will convert all inode and dentry allocation from kmem_cache_alloc to kmem_cache_alloc_lru. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: list_lru: transpose the array of per-node per-memcg lru lists	Muchun Song	1	-7/+10
	Patch series "Optimize list lru memory consumption", v6. In our server, we found a suspected memory leak problem. The kmalloc-32 consumes more than 6GB of memory. Other kmem_caches consume less than 2GB memory. After our in-depth analysis, the memory consumption of kmalloc-32 slab cache is the cause of list_lru_one allocation. crash> p memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574 memcg_nr_cache_ids is very large and memory consumption of each list_lru can be calculated with the following formula. num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) There are 4 numa nodes in our system, so each list_lru consumes ~3MB. crash> list super_blocks \| wc -l 952 Every mount will register 2 list lrus, one is for inode, another is for dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 MB (~5.6GB). But now the number of memory cgroups is less than 500. So I guess more than 12286 memory cgroups have been created on this machine (I do not know why there are so many cgroups, it may be a user's bug or the user really want to do that). Because memcg_nr_cache_ids has not been reduced to a suitable value. It leads to waste a lot of memory. If we want to reduce memcg_nr_cache_ids, we have to reboot the server. This is not what we want. In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do this. But this did not fundamentally solve the problem. We currently allocate scope for every memcg to be able to tracked on every superblock instantiated in the system, regardless of whether that superblock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confined to just a small subset of the total number of superblocks that instantiated at any given point in time. For these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that the list_lru is only needed for a given memcg if that memcg is instatiating and freeing objects on a given list_lru. As Dave said, "Which makes me think we should be moving more towards 'add the memcg to the list_lru at the first insert' model rather than 'instantiate all at memcg init time just in case'." This patchset aims to optimize the list lru memory consumption from different aspects. I had done a easy test to show the optimization. I create 10k memory cgroups and mount 10k filesystems in the systems. We use free command to show how many memory does the systems comsumes after this operation (There are 2 numa nodes in the system). +-----------------------+------------------------+ \| condition \| memory consumption \| +-----------------------+------------------------+ \| without this patchset \| 24464 MB \| +-----------------------+------------------------+ \| after patch 1 \| 21957 MB \| <--------+ +-----------------------+------------------------+ \| \| after patch 10 \| 6895 MB \| \| +-----------------------+------------------------+ \| \| after patch 12 \| 4367 MB \| \| +-----------------------+------------------------+ \| \| The more the number of nodes, the more obvious the effect---+ BTW, there was a recent discussion [2] on the same issue. [1] https://lore.kernel.org/all/[email protected]/ [2] https://lore.kernel.org/all/[email protected]/ This series not only optimizes the memory usage of list_lru but also simplifies the code. This patch (of 16): The current scheme of maintaining per-node per-memcg lru lists looks like: struct list_lru { struct list_lru_node node; (for each node) struct list_lru_memcg memcg_lrus; struct list_lru_one lru[]; (for each memcg) } By effectively transposing the two-dimension array of list_lru_one's structures (per-node per-memcg => per-memcg per-node) it's possible to save some memory and simplify alloc/dealloc paths. The new scheme looks like: struct list_lru { struct list_lru_memcg mlrus; struct list_lru_per_memcg mlru[]; (for each memcg) struct list_lru_one node[0]; (for each node) } Memory savings are coming from not only 'struct rcu_head' but also some pointer arrays used to store the pointer to 'struct list_lru_one'. The array is per node and its size is 8 (a pointer) num_memcgs. So the total size of the arrays is 8 * num_nodes * memcg_nr_cache_ids. After this patch, the size becomes 8 * memcg_nr_cache_ids. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Yang Shi <[email protected]> Cc: Alex Shi <[email protected]> Cc: Wei Yang <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Chao Yu <[email protected]> Cc: Kari Argillander <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/memcg: retrieve parent memcg from css.parent	Wei Yang	1	-3/+1
	The parent we get from page_counter is correct, while this is two different hierarchy. Let's retrieve the parent memcg from css.parent just like parent_cs(), blkcg_parent(), etc. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Yang Shi <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	memcg: add per-memcg total kernel memory stat	Yosry Ahmed	1	-0/+1
	Currently memcg stats show several types of kernel memory: kernel stack, page tables, sock, vmalloc, and slab. However, there are other allocations with __GFP_ACCOUNT (or supersets such as GFP_KERNEL_ACCOUNT) that are not accounted in any of those stats, a few examples are: - various kvm allocations (e.g. allocated pages to create vcpus) - io_uring - tmp_page in pipes during pipe_write() - bpf ringbuffers - unix sockets Keeping track of the total kernel memory is essential for the ease of migration from cgroup v1 to v2 as there are large discrepancies between v1's kmem.usage_in_bytes and the sum of the available kernel memory stats in v2. Adding separate memcg stats for all __GFP_ACCOUNT kernel allocations is an impractical maintenance burden as there a lot of those all over the kernel code, with more use cases likely to show up in the future. Therefore, add a "kernel" memcg stat that is analogous to kmem page counter, with added benefits such as using rstat infrastructure which aggregates stats more efficiently. Additionally, this provides a lighter alternative in case the legacy kmem is deprecated in the future [[email protected]: v2] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	tmpfs: support for file creation time	Xavier Roche	1	-0/+1
	Various filesystems (including ext4) now support file creation time. This adds such support for tmpfs-based filesystems. Note that using shmem_getattr() on other file types than regular requires that shmem_is_huge() check type, to stop incorrect HPAGE_PMD_SIZE blksize. [[email protected]: three tweaks to creation time patch] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/20220314211150.GA123458@xavier-xps Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/20220211213628.GA1919658@xavier-xps Signed-off-by: Xavier Roche <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Tested-by: Jean Delvare <[email protected]> Tested-by: Sylvain Bellone <[email protected]> Reported-by: Xavier Grand <[email protected]> Reviewed-by: Jean Delvare <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/gup: remove unused get_user_pages_locked()	John Hubbard	1	-2/+0
	Now that the last caller of get_user_pages_locked() is gone, remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Alex Williamson <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lukas Bulwahn <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm/gup: remove unused pin_user_pages_locked()	John Hubbard	1	-2/+0
	This routine was used for a short while, but then the calling code was refactored and the only caller was removed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Cc: Alex Williamson <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lukas Bulwahn <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	filemap: remove find_get_pages()	Miaohe Lin	1	-7/+0
	It's unused now. Remove it and clean up the relevant comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: David Howells <[email protected]> Cc: William Kucharski <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mount: warn only once about timestamp range expiration	Anthony Iliopoulos	1	-0/+1
	Commit f8b92ba67c5d ("mount: Add mount warning for impending timestamp expiry") introduced a mount warning regarding filesystem timestamp limits, that is printed upon each writable mount or remount. This can result in a lot of unnecessary messages in the kernel log in setups where filesystems are being frequently remounted (or mounted multiple times). Avoid this by setting a superblock flag which indicates that the warning has been emitted at least once for any particular mount, as suggested in [1]. Link: https://lore.kernel.org/CAHk-=wim6VGnxQmjfK_tDg6fbHYKL4EFkmnTjVr9QnRqjDBAeA@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anthony Iliopoulos <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Christian Brauner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Deepa Dinamani <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	remove congestion tracking framework	NeilBrown	3	-38/+0
	This framework is no longer used - so discard it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBrown <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Lars Ellenberg <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Paolo Valente <[email protected]> Cc: Philipp Reisner <[email protected]> Cc: Ryusuke Konishi <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	remove bdi_congested() and wb_congested() and related functions	NeilBrown	1	-26/+0
	These functions are no longer useful as no BDIs report congestions any more. Removing the test on bdi_write_contested() in current_may_throttle() could cause a small change in behaviour, but only when PF_LOCAL_THROTTLE is set. So replace the calls by 'false' and simplify the code - and remove the functions. [[email protected]: fix build] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBrown <[email protected]> Acked-by: Ryusuke Konishi <[email protected]> [nilfs] Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Lars Ellenberg <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Paolo Valente <[email protected]> Cc: Philipp Reisner <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	remove inode_congested()	NeilBrown	1	-22/+0
	inode_congested() reports if the backing-device for the inode is congested. No bdi reports congestion any more, so this always returns 'false'. So remove inode_congested() and related functions, and remove the call sites, assuming that inode_congested() always returns 'false'. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBrown <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Lars Ellenberg <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Paolo Valente <[email protected]> Cc: Philipp Reisner <[email protected]> Cc: Ryusuke Konishi <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	nfs: remove reliance on bdi congestion	NeilBrown	1	-0/+1
	The bdi congestion tracking in not widely used and will be removed. NFS is one of a small number of filesystems that uses it, setting just the async (write) congestion flag at what it determines are appropriate times. The only remaining effect of the async flag is to cause (some) WB_SYNC_NONE writes to be skipped. So instead of setting the flag, set an internal flag and change: - .writepages to do nothing if WB_SYNC_NONE and the flag is set - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE and the flag is set. The writepages change causes a behavioural change in that pageout() can now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will be called on the page which (I think) wil further delay the next attempt at writeout. This might be a good thing. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBrown <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Lars Ellenberg <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Paolo Valente <[email protected]> Cc: Philipp Reisner <[email protected]> Cc: Ryusuke Konishi <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	mm: document and polish read-ahead code	NeilBrown	1	-2/+7
	Add some "big-picture" documentation for read-ahead and polish the code to make it fit this documentation. The meaning of ->async_size is clarified to match its name. i.e. Any request to ->readahead() has a sync part and an async part. The caller will wait for the sync pages to complete, but will not wait for the async pages. The first async page is still marked PG_readahead Note that the current function names page_cache_sync_ra() and page_cache_async_ra() are misleading. All ra request are partly sync and partly async, so either part can be empty. A page_cache_sync_ra() request will usually set ->async_size non-zero, implying it is not all synchronous. When a non-zero req_count is passed to page_cache_async_ra(), the implication is that some prefix of the request is synchronous, though the calculation made there is incorrect - I haven't tried to fix it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBrown <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Chao Yu <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Lars Ellenberg <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Paolo Valente <[email protected]> Cc: Philipp Reisner <[email protected]> Cc: Ryusuke Konishi <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	doc: convert 'subsection' to 'section' in gfp.h	NeilBrown	1	-5/+5
	Patch series "Remove remaining parts of congestion tracking code", v2. This patch (of 11): Various DOC: sections in gfp.h have subsection headers (~~~) but the place where they are included in mm-api.rst does not have section, only chapters. So convert to section headers (---) to avoid confusion. Specifically if sections are added later in mm-api.rst, an error results. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBrown <[email protected]> Cc: Jan Kara <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Chao Yu <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Ryusuke Konishi <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Philipp Reisner <[email protected]> Cc: Lars Ellenberg <[email protected]> Cc: Paolo Valente <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-22	linux/kthread.h: remove unused macros	Rasmus Villemoes	1	-22/+0
	Ever since these macros were introduced in commit b56c0d8937e6 ("kthread: implement kthread_worker"), there has been precisely one user (commit 4d115420707a, "NVMe: Async IO queue deletion"), and that user went away in 2016 with db3cbfff5bcc ("NVMe: IO queue deletion re-write"). Apart from being unused, these macros are also awkward to use (which may contribute to them not being used): Having a way to statically (or on-stack) allocating the storage for the struct kthread_worker itself doesn't help much, since obviously one needs to have some code for actually _spawning_ the worker thread, which must have error checking. And these days we have the kthread_create_worker() interface which both allocates the struct kthread_worker and spawns the kthread. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rasmus Villemoes <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Petr Mladek <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Yafang Shao <[email protected]> Cc: Cai Huoqing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-17	Merge tag 'net-5.17-final' of ↵	Linus Torvalds	3	-2/+3
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, ipsec, and wireless. A few last minute revert / disable and fix patches came down from our sub-trees. We're not waiting for any fixes at this point. Current release - regressions: - Revert "netfilter: nat: force port remap to prevent shadowing well-known ports", restore working conntrack on asymmetric paths - Revert "ath10k: drop beacon and probe response which leak from other channel", restore working AP and mesh mode on QCA9984 - eth: intel: fix hang during reboot/shutdown Current release - new code bugs: - netfilter: nf_tables: disable register tracking, it needs more work to cover all corner cases Previous releases - regressions: - ipv6: fix skb_over_panic in __ip6_append_data when (admin-only) extension headers get specified - esp6: fix ESP over TCP/UDP, interpret ipv6_skip_exthdr's return value more selectively - bnx2x: fix driver load failure when FW not present in initrd Previous releases - always broken: - vsock: stop destroying unrelated sockets in nested virtualization - packet: fix slab-out-of-bounds access in packet_recvmsg() Misc: - add Paolo Abeni to networking maintainers!" * tag 'net-5.17-final' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (26 commits) iavf: Fix hang during reboot/shutdown net: mscc: ocelot: fix backwards compatibility with single-chain tc-flower offload net: bcmgenet: skip invalid partial checksums bnx2x: fix built-in kernel driver load failure net: phy: mscc: Add MODULE_FIRMWARE macros net: dsa: Add missing of_node_put() in dsa_port_parse_of net: handle ARPHRD_PIMREG in dev_is_mac_header_xmit() Revert "ath10k: drop beacon and probe response which leak from other channel" hv_netvsc: Add check for kvmalloc_array iavf: Fix double free in iavf_reset_task ice: destroy flow director filter mutex after releasing VSIs ice: fix NULL pointer dereference in ice_update_vsi_tx_ring_stats() Add Paolo Abeni to networking maintainers atm: eni: Add check for dma_map_single net/packet: fix slab-out-of-bounds access in packet_recvmsg() net: mdio: mscc-miim: fix duplicate debugfs entry net: phy: marvell: Fix invalid comparison in the resume and suspend functions esp6: fix check on ipv6_skip_exthdr's return value net: dsa: microchip: add spi_device_id tables netfilter: nf_tables: disable register tracking ...
2022-03-16	net: handle ARPHRD_PIMREG in dev_is_mac_header_xmit()	Nicolas Dichtel	1	-0/+1
	This kind of interface doesn't have a mac header. This patch fixes bpf_redirect() to a PIM interface. Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper") Signed-off-by: Nicolas Dichtel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-03-14	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf	Jakub Kicinski	1	-1/+0
	Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net coming late in the 5.17-rc process: 1) Revert port remap to mitigate shadowing service ports, this is causing problems in existing setups and this mitigation can be achieved with explicit ruleset, eg. ... tcp sport < 16386 tcp dport >= 32768 masquerade random This patches provided a built-in policy similar to the one described above. 2) Disable register tracking infrastructure in nf_tables. Florian reported two issues: - Existing expressions with no implemented .reduce interface that causes data-store on register should cancel the tracking. - Register clobbering might be possible storing data on registers that are larger than 32-bits. This might lead to generating incorrect ruleset bytecode. These two issues are scheduled to be addressed in the next release cycle. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: disable register tracking Revert "netfilter: conntrack: tag conntracks picked up in local out hook" Revert "netfilter: nat: force port remap to prevent shadowing well-known ports" ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-03-11	vsock: each transport cycles only on its own sockets	Jiyong Park	1	-1/+2
	When iterating over sockets using vsock_for_each_connected_socket, make sure that a transport filters out sockets that don't belong to the transport. There actually was an issue caused by this; in a nested VM configuration, destroying the nested VM (which often involves the closing of /dev/vhost-vsock if there was h2g connections to the nested VM) kills not only the h2g connections, but also all existing g2h connections to the (outmost) host which are totally unrelated. Tested: Executed the following steps on Cuttlefish (Android running on a VM) [1]: (1) Enter into an `adb shell` session - to have a g2h connection inside the VM, (2) open and then close /dev/vhost-vsock by `exec 3< /dev/vhost-vsock && exec 3<&-`, (3) observe that the adb session is not reset. [1] https://android.googlesource.com/device/google/cuttlefish/ Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Reviewed-by: Stefano Garzarella <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Jiyong Park <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-03-11	Merge branch 'davidh' (fixes from David Howells)	Linus Torvalds	2	-1/+4
	Merge misc fixes from David Howells: "A set of patches for watch_queue filter issues noted by Jann. I've added in a cleanup patch from Christophe Jaillet to convert to using formal bitmap specifiers for the note allocation bitmap. Also two filesystem fixes (afs and cachefiles)" * emailed patches from David Howells <[email protected]>: cachefiles: Fix volume coherency attribute afs: Fix potential thrashing in afs writeback watch_queue: Make comment about setting ->defunct more accurate watch_queue: Fix lack of barrier/sync/lock between post and read watch_queue: Free the alloc bitmap when the watch_queue is torn down watch_queue: Fix the alloc bitmap size to reflect notes allocated watch_queue: Use the bitmap API when applicable watch_queue: Fix to always request a pow-of-2 pipe ring size watch_queue: Fix to release page in ->release() watch_queue, pipe: Free watchqueue state after clearing pipe ring watch_queue: Fix filter limit check
2022-03-11	cachefiles: Fix volume coherency attribute	David Howells	1	-0/+2
	A network filesystem may set coherency data on a volume cookie, and if given, cachefiles will store this in an xattr on the directory in the cache corresponding to the volume. The function that sets the xattr just stores the contents of the volume coherency buffer directly into the xattr, with nothing added; the checking function, on the other hand, has a cut'n'paste error whereby it tries to interpret the xattr contents as would be the xattr on an ordinary file (using the cachefiles_xattr struct). This results in a failure to match the coherency data because the buffer ends up being shifted by 18 bytes. Fix this by defining a structure specifically for the volume xattr and making both the setting and checking functions use it. Since the volume coherency doesn't work if used, take the opportunity to insert a reserved field for future use, set it to 0 and check that it is 0. Log mismatch through the appropriate tracepoint. Note that this only affects cifs; 9p, afs, ceph and nfs don't use the volume coherency data at the moment. Fixes: 32e150037dce ("fscache, cachefiles: Store the volume coherency data") Reported-by: Rohith Surabattula <[email protected]> Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: Steve French <[email protected]> cc: [email protected] cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>
2022-03-11	watch_queue: Fix filter limit check	David Howells	1	-1/+2
	In watch_queue_set_filter(), there are a couple of places where we check that the filter type value does not exceed what the type_filter bitmap can hold. One place calculates the number of bits by: if (tf[i].type >= sizeof(wfilter->type_filter) * 8) which is fine, but the second does: if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG) which is not. This can lead to a couple of out-of-bounds writes due to a too-large type: (1) __set_bit() on wfilter->type_filter (2) Writing more elements in wfilter->filters[] than we allocated. Fix this by just using the proper WATCH_TYPE__NR instead, which is the number of types we actually know about. The bug may cause an oops looking something like: BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740 Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611 ... Call Trace: <TASK> dump_stack_lvl+0x45/0x59 print_address_description.constprop.0+0x1f/0x150 ... kasan_report.cold+0x7f/0x11b ... watch_queue_set_filter+0x659/0x740 ... __x64_sys_ioctl+0x127/0x190 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Allocated by task 611: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x81/0xa0 watch_queue_set_filter+0x23a/0x740 __x64_sys_ioctl+0x127/0x190 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88800d2c66a0 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 28 bytes inside of 32-byte region [ffff88800d2c66a0, ffff88800d2c66c0) Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <[email protected]> Signed-off-by: David Howells <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-10	Merge tag 'net-5.17-rc8' of ↵	Linus Torvalds	4	-5/+8
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth, and ipsec. Current release - regressions: - Bluetooth: fix unbalanced unlock in set_device_flags() - Bluetooth: fix not processing all entries on cmd_sync_work, make connect with qualcomm and intel adapters reliable - Revert "xfrm: state and policy should fail if XFRMA_IF_ID 0" - xdp: xdp_mem_allocator can be NULL in trace_mem_connect() - eth: ice: fix race condition and deadlock during interface enslave Current release - new code bugs: - tipc: fix incorrect order of state message data sanity check Previous releases - regressions: - esp: fix possible buffer overflow in ESP transformation - dsa: unlock the rtnl_mutex when dsa_master_setup() fails - phy: meson-gxl: fix interrupt handling in forced mode - smsc95xx: ignore -ENODEV errors when device is unplugged Previous releases - always broken: - xfrm: fix tunnel mode fragmentation behavior - esp: fix inter address family tunneling on GSO - tipc: fix null-deref due to race when enabling bearer - sctp: fix kernel-infoleak for SCTP sockets - eth: macb: fix lost RX packet wakeup race in NAPI receive - eth: intel stop disabling VFs due to PF error responses - eth: bcmgenet: don't claim WOL when its not available" * tag 'net-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits) xdp: xdp_mem_allocator can be NULL in trace_mem_connect(). ice: Fix race condition during interface enslave net: phy: meson-gxl: improve link-up behavior net: bcmgenet: Don't claim WOL when its not available net: arc_emac: Fix use after free in arc_mdio_probe() sctp: fix kernel-infoleak for SCTP sockets net: phy: correct spelling error of media in documentation net: phy: DP83822: clear MISR2 register to disable interrupts gianfar: ethtool: Fix refcount leak in gfar_get_ts_info selftests: pmtu.sh: Kill nettest processes launched in subshell. selftests: pmtu.sh: Kill tcpdump processes launched by subshell. NFC: port100: fix use-after-free in port100_send_complete net/mlx5e: SHAMPO, reduce TIR indication net/mlx5e: Lag, Only handle events from highest priority multipath entry net/mlx5: Fix offloading with ESWITCH_IPV4_TTL_MODIFY_ENABLE net/mlx5: Fix a race on command flush flow net/mlx5: Fix size field in bufferx_reg struct ax25: Fix NULL pointer dereference in ax25_kill_by_device net: marvell: prestera: Add missing of_node_put() in prestera_switch_set_base_mac_addr net: ethernet: lpc_eth: Handle error for clk_enable ...
2022-03-10	net: phy: correct spelling error of media in documentation	Colin Foster	1	-2/+2
	The header file incorrectly referenced "median-independant interface" instead of media. Correct this typo. Signed-off-by: Colin Foster <[email protected]> Fixes: 4069a572d423 ("net: phy: Document core PHY structures") Reviewed-by: Russell King (Oracle) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-03-09	Merge tag 'xsa396-5.17-tag' of ↵	Linus Torvalds	1	-2/+17
	git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Several Linux PV device frontends are using the grant table interfaces for removing access rights of the backends in ways being subject to race conditions, resulting in potential data leaks, data corruption by malicious backends, and denial of service triggered by malicious backends: - blkfront, netfront, scsifront and the gntalloc driver are testing whether a grant reference is still in use. If this is not the case, they assume that a following removal of the granted access will always succeed, which is not true in case the backend has mapped the granted page between those two operations. As a result the backend can keep access to the memory page of the guest no matter how the page will be used after the frontend I/O has finished. The xenbus driver has a similar problem, as it doesn't check the success of removing the granted access of a shared ring buffer. - blkfront, netfront, scsifront, usbfront, dmabuf, xenbus, 9p, kbdfront, and pvcalls are using a functionality to delay freeing a grant reference until it is no longer in use, but the freeing of the related data page is not synchronized with dropping the granted access. As a result the backend can keep access to the memory page even after it has been freed and then re-used for a different purpose. - netfront will fail a BUG_ON() assertion if it fails to revoke access in the rx path. This will result in a Denial of Service (DoS) situation of the guest which can be triggered by the backend" * tag 'xsa396-5.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/netfront: react properly to failing gnttab_end_foreign_access_ref() xen/gnttab: fix gnttab_end_foreign_access() without page specified xen/pvcalls: use alloc/free_pages_exact() xen/9p: use alloc/free_pages_exact() xen/usb: don't use gnttab_end_foreign_access() in xenhcd_gnttab_done() xen: remove gnttab_query_foreign_access() xen/gntalloc: don't use gnttab_query_foreign_access() xen/scsifront: don't use gnttab_query_foreign_access() for mapped status xen/netfront: don't use gnttab_query_foreign_access() for mapped status xen/blkfront: don't use gnttab_query_foreign_access() for mapped status xen/grant-table: add gnttab_try_end_foreign_access() xen/xenbus: don't let xenbus_grant_ring() remove grants in error case
2022-03-09	net/mlx5e: SHAMPO, reduce TIR indication	Ben Ben-Ishay	1	-1/+0
	SHAMPO is an RQ / WQ feature, an indication was added to the TIR in the first place to enforce suitability between connected TIR and RQ, this enforcement does not exist in current the Firmware implementation and was redundant in the first place. Fixes: 83439f3c37aa ("net/mlx5e: Add HW-GRO offload") Signed-off-by: Ben Ben-Ishay <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-03-09	net/mlx5: Fix size field in bufferx_reg struct	Mohammad Kabat	1	-2/+2
	According to HW spec the field "size" should be 16 bits in bufferx register. Fixes: e281682bf294 ("net/mlx5_core: HW data structs/types definitions cleanup") Signed-off-by: Mohammad Kabat <[email protected]> Reviewed-by: Moshe Shemesh <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-03-09	Merge branch 'master' of ↵	David S. Miller	2	-0/+4
	git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2022-03-09 1) Fix IPv6 PMTU discovery for xfrm interfaces. From Lina Wang. 2) Revert failing for policies and states that are configured with XFRMA_IF_ID 0. It broke a user configuration. From Kai Lueke. 3) Fix a possible buffer overflow in the ESP output path. 4) Fix ESP GSO for tunnel and BEET mode on inter address family tunnels. ==================== Signed-off-by: David S. Miller <[email protected]>
2022-03-08	Merge tag 'fuse-fixes-5.17-rc8' of ↵	Linus Torvalds	1	-0/+1
	git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: - Fix an issue with splice on the fuse device - Fix a regression in the fileattr API conversion - Add a small userspace API improvement * tag 'fuse-fixes-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix pipe buffer lifetime for direct_io fuse: move FUSE_SUPER_MAGIC definition to magic.h fuse: fix fileattr op failure
2022-03-08	Merge tag 'arm64-spectre-bhb-for-v5.17-2' of ↵	Linus Torvalds	1	-0/+5
	git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 spectre fixes from James Morse: "ARM64 Spectre-BHB mitigations: - Make EL1 vectors per-cpu - Add mitigation sequences to the EL1 and EL2 vectors on vulnerble CPUs - Implement ARCH_WORKAROUND_3 for KVM guests - Report Vulnerable when unprivileged eBPF is enabled" * tag 'arm64-spectre-bhb-for-v5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: proton-pack: Include unprivileged eBPF status in Spectre v2 mitigation reporting arm64: Use the clearbhb instruction in mitigations KVM: arm64: Allow SMCCC_ARCH_WORKAROUND_3 to be discovered and migrated arm64: Mitigate spectre style branch history side channels arm64: proton-pack: Report Spectre-BHB vulnerabilities as part of Spectre-v2 arm64: Add percpu vectors for EL1 arm64: entry: Add macro for reading symbol addresses from the trampoline arm64: entry: Add vectors that have the bhb mitigation sequences arm64: entry: Add non-kpti __bp_harden_el1_vectors for mitigations arm64: entry: Allow the trampoline text to occupy multiple pages arm64: entry: Make the kpti trampoline's kpti sequence optional arm64: entry: Move trampoline macros out of ifdef'd section arm64: entry: Don't assume tramp_vectors is the start of the vectors arm64: entry: Allow tramp_alias to access symbols after the 4K boundary arm64: entry: Move the trampoline data page before the text page arm64: entry: Free up another register on kpti's tramp_exit path arm64: entry: Make the trampoline cleanup optional KVM: arm64: Allow indirect vectors to be used without SPECTRE_V3A arm64: spectre: Rename spectre_v4_patch_fw_mitigation_conduit arm64: entry.S: Add ventry overflow sanity checks