aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2023-12-29mm/sparsemem: fix race in accessing memory_section->usageCharan Teja Kalla1-3/+11
The below race is observed on a PFN which falls into the device memory region with the system memory configuration where PFN's are such that [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and end pfn contains the device memory PFN's as well, the compaction triggered will try on the device memory PFN's too though they end up in NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory sections). When from other core, the section mappings are being removed for the ZONE_DEVICE region, that the PFN in question belongs to, on which compaction is currently being operated is resulting into the kernel crash with CONFIG_SPASEMEM_VMEMAP enabled. The crash logs can be seen at [1]. compact_zone() memunmap_pages ------------- --------------- __pageblock_pfn_to_page ...... (a)pfn_valid(): valid_section()//return true (b)__remove_pages()-> sparse_remove_section()-> section_deactivate(): [Free the array ms->usage and set ms->usage = NULL] pfn_section_valid() [Access ms->usage which is NULL] NOTE: From the above it can be said that the race is reduced to between the pfn_valid()/pfn_section_valid() and the section deactivate with SPASEMEM_VMEMAP enabled. The commit b943f045a9af("mm/sparse: fix kernel crash with pfn_section_valid check") tried to address the same problem by clearing the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns false thus ms->usage is not accessed. Fix this issue by the below steps: a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage. b) RCU protected read side critical section will either return NULL when SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage. c) Free the ->usage with kfree_rcu() and set ms->usage = NULL. No attempt will be made to access ->usage after this as the SECTION_HAS_MEM_MAP is cleared thus valid_section() return false. Thanks to David/Pavan for their inputs on this patch. [1] https://lore.kernel.org/linux-mm/[email protected]/ On Snapdragon SoC, with the mentioned memory configuration of PFN's as [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL], we are able to see bunch of issues daily while testing on a device farm. For this particular issue below is the log. Though the below log is not directly pointing to the pfn_section_valid(){ ms->usage;}, when we loaded this dump on T32 lauterbach tool, it is pointing. [ 540.578056] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 540.578068] Mem abort info: [ 540.578070] ESR = 0x0000000096000005 [ 540.578073] EC = 0x25: DABT (current EL), IL = 32 bits [ 540.578077] SET = 0, FnV = 0 [ 540.578080] EA = 0, S1PTW = 0 [ 540.578082] FSC = 0x05: level 1 translation fault [ 540.578085] Data abort info: [ 540.578086] ISV = 0, ISS = 0x00000005 [ 540.578088] CM = 0, WnR = 0 [ 540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBSBTYPE=--) [ 540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c [ 540.579454] lr : compact_zone+0x994/0x1058 [ 540.579460] sp : ffffffc03579b510 [ 540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27:000000000000000c [ 540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24:ffffffc03579b640 [ 540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21:0000000000000000 [ 540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18:ffffffdebf66d140 [ 540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15:00000000009f4bff [ 540.579495] x14: 0000008000000000 x13: 0000000000000000 x12:0000000000000001 [ 540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 :ffffff897d2cd440 [ 540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 :ffffffc03579b5b4 [ 540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 :0000000000000001 [ 540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 :0000000000235800 [ 540.579524] Call trace: [ 540.579527] __pageblock_pfn_to_page+0x6c/0x14c [ 540.579533] compact_zone+0x994/0x1058 [ 540.579536] try_to_compact_pages+0x128/0x378 [ 540.579540] __alloc_pages_direct_compact+0x80/0x2b0 [ 540.579544] __alloc_pages_slowpath+0x5c0/0xe10 [ 540.579547] __alloc_pages+0x250/0x2d0 [ 540.579550] __iommu_dma_alloc_noncontiguous+0x13c/0x3fc [ 540.579561] iommu_dma_alloc+0xa0/0x320 [ 540.579565] dma_alloc_attrs+0xd4/0x108 [[email protected]: use kfree_rcu() in place of synchronize_rcu(), per David] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot") Signed-off-by: Charan Teja Kalla <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29mm: remove VM_EXEC requirement for THP eligibilityFangrui Song1-1/+0
Commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled()") introduced the VM_EXEC requirement, which is not strictly needed. lld's default --rosegment option and GNU ld's -z separate-code option (default on Linux/x86 since binutils 2.31) create a read-only PT_LOAD segment without the PF_X flag, which should be eligible for THP. Certain architectures support medium and large code models, where .lrodata may be placed in a separate read-only PT_LOAD segment, which should be eligible for THP as well. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fangrui Song <[email protected]> Acked-by: Yang Shi <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Song Liu <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29lib/stackdepot: fix comment in include/linux/stackdepot.hAndrey Konovalov1-2/+0
As stack traces can now be evicted from the stack depot, remove the comment saying that they are never removed. Link: https://lkml.kernel.org/r/0ebe712d91f8d302a8947d3c9e9123bc2b1b8440.1703020707.git.andreyknvl@google.com Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces") Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29kasan: rename and document kasan_(un)poison_object_dataAndrey Konovalov1-8/+27
Rename kasan_unpoison_object_data to kasan_unpoison_new_object and add a documentation comment. Do the same for kasan_poison_object_data. The new names and the comments should suggest the users that these hooks are intended for internal use by the slab allocator. The following patch will remove non-slab-internal uses of these hooks. No functional changes. [[email protected]: update references to renamed functions in comments] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/eab156ebbd635f9635ef67d1a4271f716994e628.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Breno Leitao <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29mempool: introduce mempool_use_prealloc_onlyAndrey Konovalov1-0/+1
Introduce a new mempool_alloc_preallocated API that asks the mempool to only use the elements preallocated during the mempool's creation when allocating and to not attempt allocating new ones from the underlying allocator. This API is required to test the KASAN poisoning/unpoisoning functionality in KASAN tests, but it might be also useful on its own. Link: https://lkml.kernel.org/r/a14d809dbdfd04cc33bcacc632fee2abd6b83c00.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Breno Leitao <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29kasan: save alloc stack traces for mempoolAndrey Konovalov1-3/+4
Update kasan_mempool_unpoison_object to properly poison the redzone and save alloc strack traces for kmalloc and slab pools. As a part of this change, split out and use a unpoison_slab_object helper function from __kasan_slab_alloc. [[email protected]: mark unpoison_slab_object() as static] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/05ad235da8347cfe14d496d01b2aaf074b4f607c.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Signed-off-by: Nathan Chancellor <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Breno Leitao <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29kasan: save free stack traces for slab mempoolsAndrey Konovalov1-2/+3
Make kasan_mempool_poison_object save free stack traces for slab and kmalloc mempools when the object is freed into the mempool. Also simplify and rename ____kasan_slab_free to poison_slab_object and do a few other reability changes. Link: https://lkml.kernel.org/r/413a7c7c3344fb56809853339ffaabc9e4905e94.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Breno Leitao <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29kasan: introduce kasan_mempool_unpoison_pagesAndrey Konovalov1-0/+25
Introduce and document a new kasan_mempool_unpoison_pages hook to be used by the mempool code instead of kasan_unpoison_pages. This hook is not functionally different from kasan_unpoison_pages, but using it improves the mempool code readability. Link: https://lkml.kernel.org/r/239bd9af6176f2cc59f5c25893eb36143184daff.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Breno Leitao <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29kasan: introduce kasan_mempool_poison_pagesAndrey Konovalov1-0/+27
Introduce and document a kasan_mempool_poison_pages hook to be used by the mempool code instead of kasan_poison_pages. Compated to kasan_poison_pages, the new hook: 1. For the tag-based modes, skips checking and poisoning allocations that were not tagged due to sampling. 2. Checks for double-free and invalid-free bugs. In the future, kasan_poison_pages can also be updated to handle #2, but this is out-of-scope of this series. Link: https://lkml.kernel.org/r/88dc7340cce28249abf789f6e0c792c317df9ba5.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Breno Leitao <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29kasan: introduce kasan_mempool_unpoison_objectAndrey Konovalov1-0/+31
Introduce and document a kasan_mempool_unpoison_object hook. This hook serves as a replacement for the generic kasan_unpoison_range that the mempool code relies on right now. mempool will be updated to use the new hook in one of the following patches. For now, define the new hook to be identical to kasan_unpoison_range. One of the following patches will update it to add stack trace collection. Link: https://lkml.kernel.org/r/dae25f0e18ed8fd50efe509c5b71a0592de5c18d.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Breno Leitao <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29kasan: add return value for kasan_mempool_poison_objectAndrey Konovalov1-5/+12
Add a return value for kasan_mempool_poison_object that lets the caller know whether the allocation is affected by a double-free or an invalid-free bug. The caller can use this return value to stop operating on the object. Also introduce a check_page_allocation helper function to improve the code readability. Link: https://lkml.kernel.org/r/618af65273875fb9f56954285443279b15f1fcd9.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Breno Leitao <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29kasan: document kasan_mempool_poison_objectAndrey Konovalov1-0/+18
Add documentation comment for kasan_mempool_poison_object. Link: https://lkml.kernel.org/r/af33ba8cabfa1ad731fe23a3f874bfc8d3b7fed4.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Breno Leitao <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29kasan: move kasan_mempool_poison_objectAndrey Konovalov1-8/+8
Move kasan_mempool_poison_object after all slab-related KASAN hooks. This is a preparatory change for the following patches in this series. No functional changes. Link: https://lkml.kernel.org/r/23ea215409f43c13cdf9ecc454501a264c107d67.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Breno Leitao <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29kasan: rename kasan_slab_free_mempool to kasan_mempool_poison_objectAndrey Konovalov1-4/+4
Patch series "kasan: save mempool stack traces". This series updates KASAN to save alloc and free stack traces for secondary-level allocators that cache and reuse allocations internally instead of giving them back to the underlying allocator (e.g. mempool). As a part of this change, introduce and document a set of KASAN hooks: bool kasan_mempool_poison_pages(struct page *page, unsigned int order); void kasan_mempool_unpoison_pages(struct page *page, unsigned int order); bool kasan_mempool_poison_object(void *ptr); void kasan_mempool_unpoison_object(void *ptr, size_t size); and use them in the mempool code. Besides mempool, skbuff and io_uring also cache allocations and already use KASAN hooks to poison those. Their code is updated to use the new mempool hooks. The new hooks save alloc and free stack traces (for normal kmalloc and slab objects; stack traces for large kmalloc objects and page_alloc are not supported by KASAN yet), improve the readability of the users' code, and also allow the users to prevent double-free and invalid-free bugs; see the patches for the details. This patch (of 21): Rename kasan_slab_free_mempool to kasan_mempool_poison_object. kasan_slab_free_mempool is a slightly confusing name: it is unclear whether this function poisons the object when it is freed into mempool or does something when the object is freed from mempool to the underlying allocator. The new name also aligns with other mempool-related KASAN hooks added in the following patches in this series. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/c5618685abb7cdbf9fb4897f565e7759f601da84.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Breno Leitao <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29fs: remove the bh_end_io argument from __block_write_full_folioMatthew Wilcox (Oracle)1-3/+1
All callers are passing end_buffer_async_write as this argument, so we can hardcode references to it within __block_write_full_folio(). That lets us make end_buffer_async_write() static. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29fs: convert block_write_full_page to block_write_full_folioMatthew Wilcox (Oracle)1-2/+2
Convert the function to be compatible with writepage_t so that it can be passed to write_cache_pages() by blkdev. This removes a call to compound_head(). We can also remove the function export as both callers are built-in. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29fs: remove clean_page_buffers()Matthew Wilcox (Oracle)1-1/+0
Patch series "Clean up the writeback paths". Most of these patches verge on the trivial, converting filesystems that just use block_write_full_page() to use mpage_writepages(). But as we saw with Christoph's earlier patchset, there can be some "interesting" gotchas, and I clearly haven't tested the majority of filesystems I've touched here. Patches 3 & 4 get rid of a lot of stack usage on architectures with larger page sizes; 1024 bytes on 64-bit systems with 64KiB pages. It starts to open the door to larger folio sizes on all architectures, but it's certainly not enough yet. Patch 14 is kind of trivial, but it's nice to get that simplification in. This patch (of 14): This function has been unused since the removal of bdev_write_page(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29mm: remove page_swap_info()Matthew Wilcox (Oracle)1-2/+1
It's more efficient to get the swap_info_struct by calling swp_swap_info() directly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29mm: convert swap_page_sector() to swap_folio_sector()Matthew Wilcox (Oracle)1-1/+1
All callers have a folio, so pass it in. Saves a couple of calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29mm: return the folio from __read_swap_cache_async()Matthew Wilcox (Oracle)1-2/+2
Patch series "More swap folio conversions". These all seem like fairly straightforward conversions to me. A lot of compound_head() calls get removed. And page_swap_info(), which is nice. This patch (of 13): Move the folio->page conversion into the callers that actually want that. Most of the callers are happier with the folio anyway. If the page_allocated boolean is set, the folio allocated is of order-0, so it is safe to pass the page directly to swap_readpage(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29mm/zswap: change per-cpu mutex and buffer to per-acomp_ctxChengming Zhou1-1/+0
First of all, we need to rename acomp_ctx->dstmem field to buffer, since we are now using for purposes other than compression. Then we change per-cpu mutex and buffer to per-acomp_ctx, since them belong to the acomp_ctx and are necessary parts when used in the compress/decompress contexts. So we can remove the old per-cpu mutex and dstmem. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Acked-by: Chris Li <[email protected]> (Google) Reviewed-by: Nhat Pham <[email protected]> Cc: Barry Song <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29mm: remove page_add_new_anon_rmap and lru_cache_add_inactive_or_unevictableMatthew Wilcox (Oracle)2-5/+0
All callers have now been converted to folio_add_new_anon_rmap() and folio_add_lru_vma() so we can remove the wrapper. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29mm: convert ksm_might_need_to_copy() to work on foliosMatthew Wilcox (Oracle)1-3/+3
Patch series "Finish two folio conversions". Most callers of page_add_new_anon_rmap() and lru_cache_add_inactive_or_unevictable() have been converted to their folio equivalents, but there are still a few stragglers. There's a bit of preparatory work in ksm and unuse_pte(), but after that it's pretty mechanical. This patch (of 9): Accept a folio as an argument and return a folio result. Removes a call to compound_head() in do_swap_page(), and prevents folio & page from getting out of sync in unuse_pte(). Reviewed-by: David Hildenbrand <[email protected]> [[email protected]: fix smatch warning] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: only adjust the page if the folio changed] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29userfaultfd: UFFDIO_MOVE uABIAndrea Arcangeli2-0/+16
Implement the uABI of UFFDIO_MOVE ioctl. UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are available (in userspace) for recycling, as is usually the case in heap compaction algorithms, then we can avoid the page allocation and memcpy (done by UFFDIO_COPY). Also, since the pages are recycled in the userspace, we avoid the need to release (via madvise) the pages back to the kernel [2]. We see over 40% reduction (on a Google pixel 6 device) in the compacting thread's completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was measured using a benchmark that emulates a heap compaction implementation using userfaultfd (to allow concurrent accesses by application threads). More details of the usecase are explained in [2]. Furthermore, UFFDIO_MOVE enables moving swapped-out pages without touching them within the same vma. Today, it can only be done by mremap, however it forces splitting the vma. [1] https://lore.kernel.org/all/[email protected]/ [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/ Update for the ioctl_userfaultfd(2) manpage: UFFDIO_MOVE (Since Linux xxx) Move a continuous memory chunk into the userfault registered range and optionally wake up the blocked thread. The source and destination addresses and the number of bytes to move are specified by the src, dst, and len fields of the uffdio_move structure pointed to by argp: struct uffdio_move { __u64 dst; /* Destination of move */ __u64 src; /* Source of move */ __u64 len; /* Number of bytes to move */ __u64 mode; /* Flags controlling behavior of move */ __s64 move; /* Number of bytes moved, or negated error */ }; The following value may be bitwise ORed in mode to change the behavior of the UFFDIO_MOVE operation: UFFDIO_MOVE_MODE_DONTWAKE Do not wake up the thread that waits for page-fault resolution UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES Allow holes in the source virtual range that is being moved. When not specified, the holes will result in ENOENT error. When specified, the holes will be accounted as successfully moved memory. This is mostly useful to move hugepage aligned virtual regions without knowing if there are transparent hugepages in the regions or not, but preventing the risk of having to split the hugepage during the operation. The move field is used by the kernel to return the number of bytes that was actually moved, or an error (a negated errno- style value). If the value returned in move doesn't match the value that was specified in len, the operation fails with the error EAGAIN. The move field is output-only; it is not read by the UFFDIO_MOVE operation. The operation may fail for various reasons. Usually, remapping of pages that are not exclusive to the given process fail; once KSM might deduplicate pages or fork() COW-shares pages during fork() with child processes, they are no longer exclusive. Further, the kernel might only perform lightweight checks for detecting whether the pages are exclusive, and return -EBUSY in case that check fails. To make the operation more likely to succeed, KSM should be disabled, fork() should be avoided or MADV_DONTFORK should be configured for the source VMA before fork(). This ioctl(2) operation returns 0 on success. In this case, the entire area was moved. On error, -1 is returned and errno is set to indicate the error. Possible errors include: EAGAIN The number of bytes moved (i.e., the value returned in the move field) does not equal the value that was specified in the len field. EINVAL Either dst or len was not a multiple of the system page size, or the range specified by src and len or dst and len was invalid. EINVAL An invalid bit was specified in the mode field. ENOENT The source virtual memory range has unmapped holes and UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set. EEXIST The destination virtual memory range is fully or partially mapped. EBUSY The pages in the source virtual memory range are either pinned or not exclusive to the process. The kernel might only perform lightweight checks for detecting whether the pages are exclusive. To make the operation more likely to succeed, KSM should be disabled, fork() should be avoided or MADV_DONTFORK should be configured for the source virtual memory area before fork(). ENOMEM Allocating memory needed for the operation failed. ESRCH The target process has exited at the time of a UFFDIO_MOVE operation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Cc: Al Viro <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuah Khan <[email protected]> Cc: ZhangPeng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29Merge tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linuxLinus Torvalds1-1/+1
Pull block fixes from Jens Axboe: "Fix for a badly numbered flag, and a regression fix for the badblocks updates from this merge window" * tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux: block: renumber QUEUE_FLAG_HW_WC badblocks: avoid checking invalid range in badblocks_check()
2023-12-29thermal/sysfs: Update governors when the 'weight' has changedLukasz Luba1-0/+1
Support governors update when the thermal instance's weight has changed. This allows to adjust internal state for the governor. Signed-off-by: Lukasz Luba <[email protected]> [ rjw: Add two empty code lines aroung the locking ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-12-29thermal: core: Add governor callback for thermal zone changeLukasz Luba1-0/+6
Add a new callback to the struct thermal_governor. It can be used for updating governors when there is a change in the thermal zone internals, e.g. thermal cooling device is bind to the thermal zone. That makes possible to move some heavy operations like memory allocations related to the number of cooling instances out of the throttle() callback. Both callback code paths (throttle() and update_tz()) are protected with the same thermal zone lock, which guaranties the consistency. Signed-off-by: Lukasz Luba <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-12-29ethtool: reformat kerneldoc for struct ethtool_fec_statsJonathan Corbet1-2/+4
The kerneldoc comment for struct ethtool_fec_stats attempts to describe the "total" and "lanes" fields of the ethtool_fec_stat substructure in a way leading to these warnings: ./include/linux/ethtool.h:424: warning: Excess struct member 'lane' description in 'ethtool_fec_stats' ./include/linux/ethtool.h:424: warning: Excess struct member 'total' description in 'ethtool_fec_stats' Reformat the comment to retain the information while eliminating the warnings. Signed-off-by: Jonathan Corbet <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-12-28Merge tag 'kbuild-fixes-v6.7-2' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Revive proper alignment for the ksymtab and kcrctab sections - Fix gen_compile_commands.py tool to resolve symbolic links - Fix symbolic links to installed debug VDSO files - Update MAINTAINERS * tag 'kbuild-fixes-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: linux/export: Ensure natural alignment of kcrctab array kbuild: fix build ID symlinks to installed debug VDSO files gen_compile_commands.py: fix path resolve with symlinks in it MAINTAINERS: Add scripts/clang-tools to Kbuild section linux/export: Fix alignment for 64-bit ksymtab entries
2023-12-29linux/export: Ensure natural alignment of kcrctab arrayHelge Deller1-0/+1
The ___kcrctab section holds an array of 32-bit CRC values. Add a .balign 4 to tell the linker the correct memory alignment. Fixes: f3304ecd7f06 ("linux/export: use inline assembler to populate symbol CRCs") Signed-off-by: Helge Deller <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2023-12-28thermal: core: Fix thermal zone suspend-resume synchronizationRafael J. Wysocki1-0/+2
There are 3 synchronization issues with thermal zone suspend-resume during system-wide transitions: 1. The resume code runs in a PM notifier which is invoked after user space has been thawed, so it can run concurrently with user space which can trigger a thermal zone device removal. If that happens, the thermal zone resume code may use a stale pointer to the next list element and crash, because it does not hold thermal_list_lock while walking thermal_tz_list. 2. The thermal zone resume code calls thermal_zone_device_init() outside the zone lock, so user space or an update triggered by the platform firmware may see an inconsistent state of a thermal zone leading to unexpected behavior. 3. Clearing the in_suspend global variable in thermal_pm_notify() allows __thermal_zone_device_update() to continue for all thermal zones and it may as well run before the thermal_tz_list walk (or at any point during the list walk for that matter) and attempt to operate on a thermal zone that has not been resumed yet. It may also race destructively with thermal_zone_device_init(). To address these issues, add thermal_list_lock locking to thermal_pm_notify(), especially arount the thermal_tz_list, make it call thermal_zone_device_init() back-to-back with __thermal_zone_device_update() under the zone lock and replace in_suspend with per-zone bool "suspend" indicators set and unset under the given zone's lock. Link: https://lore.kernel.org/linux-pm/[email protected]/ Reported-by: Bo Ye <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-12-28sysctl: remove struct ctl_pathThomas Weißschuh1-5/+0
All usages of this struct have been removed from the kernel tree. The struct is still referenced by scripts/check-sysctl-docs but that script is broken anyways as it only supports the register_sysctl_paths() API and not the currently used register_sysctl() one. Fixes: 0199849acd07 ("sysctl: remove register_sysctl_paths()") Signed-off-by: Thomas Weißschuh <[email protected]> Reviewed-by: Joel Granados <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2023-12-28sysctl: delete unused define SYSCTL_PERM_EMPTY_DIRThomas Weißschuh1-2/+0
It seems it was never used. Fixes: 2f2665c13af4 ("sysctl: replace child with an enumeration") Signed-off-by: Thomas Weißschuh <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2023-12-28fs: fix __sb_write_started() kerneldoc formattingVegard Nossum1-3/+3
When running 'make htmldocs', I see the following warning: Documentation/filesystems/api-summary:14: ./include/linux/fs.h:1659: WARNING: Definition list ends without a blank line; unexpected unindent. The official guidance [1] seems to be to use lists, which will prevent both the "unexpected unindent" warning as well as ensure that each line is formatted on a separate line in the HTML output instead of being all considered a single paragraph. [1]: https://docs.kernel.org/doc-guide/kernel-doc.html#return-values Fixes: 8802e580ee64 ("fs: create __sb_write_started() helper") Cc: Amir Goldstein <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Vegard Nossum <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2023-12-28netfs: Optimise away reads above the point at which there can be no dataDavid Howells1-3/+18
Track the file position above which the server is not expected to have any data (the "zero point") and preemptively assume that we can satisfy requests by filling them with zeroes locally rather than attempting to download them if they're over that line - even if we've written data back to the server. Assume that any data that was written back above that position is held in the local cache. Note that we have to split requests that straddle the line. Make use of this to optimise away some reads from the server. We need to set the zero point in the following circumstances: (1) When we see an extant remote inode and have no cache for it, we set the zero_point to i_size. (2) On local inode creation, we set zero_point to 0. (3) On local truncation down, we reduce zero_point to the new i_size if the new i_size is lower. (4) On local truncation up, we don't change zero_point. (5) On local modification, we don't change zero_point. (6) On remote invalidation, we set zero_point to the new i_size. (7) If stored data is discarded from the pagecache or culled from fscache, we must set zero_point above that if the data also got written to the server. (8) If dirty data is written back to the server, but not fscache, we must set zero_point above that. (9) If a direct I/O write is made, set zero_point above that. Assuming the above, any read from the server at or above the zero_point position will return all zeroes. The zero_point value can be stored in the cache, provided the above rules are applied to it by any code that culls part of the local cache. Signed-off-by: David Howells <[email protected]> cc: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Implement a write-through caching optionDavid Howells1-0/+2
Provide a flag whereby a filesystem may request that cifs_perform_write() perform write-through caching. This involves putting pages directly into writeback rather than dirty and attaching them to a write operation as we go. Further, the writes being made are limited to the byte range being written rather than whole folios being written. This can be used by cifs, for example, to deal with strict byte-range locking. This can't be used with content encryption as that may require expansion of the write RPC beyond the write being made. This doesn't affect writes via mmap - those are written back in the normal way; similarly failed writethrough writes are marked dirty and left to writeback to retry. Another option would be to simply invalidate them, but the contents can be simultaneously accessed by read() and through mmap. Signed-off-by: David Howells <[email protected]> cc: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Provide a launder_folio implementationDavid Howells1-0/+2
Provide a launder_folio implementation for netfslib. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Provide a writepages implementationDavid Howells1-0/+2
Provide an implementation of writepages for network filesystems to delegate to. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs, cachefiles: Pass upper bound length to allow expansionDavid Howells1-2/+3
Make netfslib pass the maximum length to the ->prepare_write() op to tell the cache how much it can expand the length of a write to. This allows a write to the server at the end of a file to be limited to a few bytes whilst writing an entire block to the cache (something required by direct I/O). Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Provide netfs_file_read_iter()David Howells1-0/+2
Provide a top-level-ish function that can be pointed to directly by ->read_iter file op. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()David Howells1-0/+4
Provide an entry point to delegate a filesystem's ->page_mkwrite() to. This checks for conflicting writes, then attached any netfs-specific group marking (e.g. ceph snap) to the page to be considered dirty. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Implement buffered write APIDavid Howells1-0/+3
Institute a netfs write helper, netfs_file_write_iter(), to be pointed at by the network filesystem ->write_iter() call. Make it handled buffered writes by calling the previously defined netfs_perform_write() to copy the source data into the pagecache. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Implement unbuffered/DIO write supportDavid Howells1-0/+4
Implement support for unbuffered writes and direct I/O writes. If the write is misaligned with respect to the fscrypt block size, then RMW cycles are performed if necessary. DIO writes are a special case of unbuffered writes with extra restriction imposed, such as block size alignment requirements. Also provide a field that can tell the code to add some extra space onto the bounce buffer for use by the filesystem in the case of a content-encrypted file. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Implement unbuffered/DIO read supportDavid Howells1-0/+9
Implement support for unbuffered and DIO reads in the netfs library, utilising the existing read helper code to do block splitting and individual queuing. The code also handles extraction of the destination buffer from the supplied iterator, allowing async unbuffered reads to take place. The read will be split up according to the rsize setting and, if supplied, the ->clamp_length() method. Note that the next subrequest will be issued as soon as issue_op returns, without waiting for previous ones to finish. The network filesystem needs to pause or handle queuing them if it doesn't want to fire them all at the server simultaneously. Once all the subrequests have finished, the state will be assessed and the amount of data to be indicated as having being obtained will be determined. As the subrequests may finish in any order, if an intermediate subrequest is short, any further subrequests may be copied into the buffer and then abandoned. In the future, this will also take care of doing an unbuffered read from encrypted content, with the decryption being done by the library. Signed-off-by: David Howells <[email protected]> cc: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Provide func to copy data to pagecache for buffered writeDavid Howells1-0/+5
Provide a netfs write helper, netfs_perform_write() to buffer data to be written in the pagecache and mark the modified folios dirty. It will perform "streaming writes" for folios that aren't currently resident, if possible, storing data in partially modified folios that are marked dirty, but not uptodate. It will also tag pages as belonging to fs-specific write groups if so directed by the filesystem. This is derived from generic_perform_write(), but doesn't use ->write_begin() and ->write_end(), having that logic rolled in instead. Signed-off-by: David Howells <[email protected]> cc: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Dispatch write requests to process a writeback sliceDavid Howells1-0/+13
Dispatch one or more write reqeusts to process a writeback slice, where a slice is tailored more to logical block divisions within the file (such as crypto blocks, an object layout or cache granules) than the protocol RPC maximum capacity. The dispatch doesn't happen until throttling allows, at which point the entire writeback slice is processed and queued. A slice may be written to multiple destinations (one or more servers and the local cache) and the writes to each destination might be split up along different lines. The writeback slice holds the required folios pinned. An iov_iter is provided in netfs_write_request that describes the buffer to be used. This may be part of the pagecache, may have auxiliary padding pages attached or may be a bounce buffer resulting from crypto or compression. Consequently, the filesystem must not twiddle the folio markings directly. The following API is available to the filesystem: (1) The ->create_write_requests() method is called to ask the filesystem to create the requests it needs. This is passed the writeback slice to be processed. (2) The filesystem should then call netfs_create_write_request() to create the requests it needs. (3) Once a request is initialised, netfs_queue_write_request() can be called to dispatch it asynchronously, if not completed immediately. (4) netfs_write_request_completed() should be called to note the completion of a request. (5) netfs_get_write_request() and netfs_put_write_request() are provided to refcount a request. These take constants from the netfs_wreq_trace enum for logging into ftrace. (6) The ->free_write_request is method is called to ask the filesystem to clean up a request. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Prep to use folio->private for write grouping and streaming writeDavid Howells1-0/+41
Prepare to use folio->private to hold information write grouping and streaming write. These are implemented in the same commit as they both make use of folio->private and will be both checked at the same time in several places. "Write grouping" involves ordering the writeback of groups of writes, such as is needed for ceph snaps. A group is represented by a filesystem-supplied object which must contain a netfs_group struct. This contains just a refcount and a pointer to a destructor. "Streaming write" is the storage of data in folios that are marked dirty, but not uptodate, to avoid unnecessary reads of data. This is represented by a netfs_folio struct. This contains the offset and length of the modified region plus the otherwise displaced write grouping pointer. The way folio->private is multiplexed is: (1) If private is NULL then neither is in operation on a dirty folio. (2) If private is set, with bit 0 clear, then this points to a group. (3) If private is set, with bit 0 set, then this points to a netfs_folio struct (with bit 0 AND'ed out). Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Add a hook to allow tell the netfs to update its i_sizeDavid Howells1-0/+4
Add a hook for netfslib's write helpers to call to tell the network filesystem that it should update its i_size. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Extend the netfs_io_*request structs to handle writesDavid Howells1-1/+14
Modify the netfs_io_request struct to act as a point around which writes can be coordinated. It represents and pins a range of pages that need writing and a list of regions of dirty data in that range of pages. If RMW is required, the original data can be downloaded into the bounce buffer, decrypted if necessary, the modifications made, then the modified data can be reencrypted/recompressed and sent back to the server. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]
2023-12-28netfs: Limit subrequest by size or number of segmentsDavid Howells1-0/+1
Limit a subrequest to a maximum size and/or a maximum number of contiguous physical regions. This permits, for instance, an subreq's iterator to be limited to the number of DMA'able segments that a large RDMA request can handle. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected]