aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-08-08get rid of non-advancing variantsAl Viro2-31/+20
mechanical change; will be further massaged in subsequent commits Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08ceph: switch the last caller of iov_iter_get_pages_alloc()Al Viro1-1/+1
here nothing even looks at the iov_iter after the call, so we couldn't care less whether it advances or not. Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-089p: convert to advancing variant of iov_iter_get_pages_alloc()Al Viro3-19/+26
that one is somewhat clumsier than usual and needs serious testing. Signed-off-by: Al Viro <[email protected]>
2022-08-08af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()Al Viro2-4/+4
... and adjust the callers Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()Al Viro1-23/+24
... and untangle the cleanup on failure to add into pipe. Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08block: convert to advancing variants of iov_iter_get_pages{,_alloc}()Al Viro2-14/+18
... doing revert if we end up not using some pages Signed-off-by: Al Viro <[email protected]>
2022-08-08iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()Al Viro13-30/+34
Most of the users immediately follow successful iov_iter_get_pages() with advancing by the amount it had returned. Provide inline wrappers doing that, convert trivial open-coded uses of those. BTW, iov_iter_get_pages() never returns more than it had been asked to; such checks in cifs ought to be removed someday... Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08iov_iter: saner helper for page array allocationAl Viro1-45/+32
All call sites of get_pages_array() are essenitally identical now. Replace with common helper... Returns number of slots available in resulting array or 0 on OOM; it's up to the caller to make sure it doesn't ask to zero-entry array (i.e. neither maxpages nor size are allowed to be zero). Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08fold __pipe_get_pages() into pipe_get_pages()Al Viro1-37/+38
... and don't mangle maxsize there - turn the loop into counting one instead. Easier to see that we won't run out of array that way. Note that special treatment of the partial buffer in that thing is an artifact of the non-advancing semantics of iov_iter_get_pages() - if not for that, it would be append_pipe(), same as the body of the loop that follows it. IOW, once we make iov_iter_get_pages() advancing, the whole thing will turn into calculate how many pages do we want allocate an array (if needed) call append_pipe() that many times. Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08ITER_XARRAY: don't open-code DIV_ROUND_UP()Al Viro1-9/+1
Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() gutsAl Viro1-59/+27
same as for pipes and xarrays; after that iov_iter_get_pages() becomes a wrapper for __iov_iter_get_pages_alloc(). Signed-off-by: Al Viro <[email protected]>
2022-08-08unify xarray_get_pages() and xarray_get_pages_alloc()Al Viro1-39/+10
same as for pipes Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08unify pipe_get_pages() and pipe_get_pages_alloc()Al Viro1-32/+17
The differences between those two are * pipe_get_pages() gets a non-NULL struct page ** value pointing to preallocated array + array size. * pipe_get_pages_alloc() gets an address of struct page ** variable that contains NULL, allocates the array and (on success) stores its address in that variable. Not hard to combine - always pass struct page ***, have the previous pipe_get_pages_alloc() caller pass ~0U as cap for array size. Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08iov_iter_get_pages(): sanity-check argumentsAl Viro1-7/+2
zero maxpages is bogus, but best treated as "just return 0"; NULL pages, OTOH, should be treated as a hard bug. get rid of now completely useless checks in xarray_get_pages{,_alloc}(). Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into ↵Al Viro1-16/+22
wrapper Incidentally, ITER_XARRAY did *not* free the sucker in case when iter_xarray_populate_pages() returned 0... Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08ITER_PIPE: fold data_start() and pipe_space_for_user() togetherAl Viro2-45/+19
All their callers are next to each other; all of them want the total amount of pages and, possibly, the offset in the partial final buffer. Combine into a new helper (pipe_npages()), fix the bogosity in pipe_space_for_user(), while we are at it. Signed-off-by: Al Viro <[email protected]>
2022-08-08ITER_PIPE: cache the type of last bufferAl Viro2-40/+42
We often need to find whether the last buffer is anon or not, and currently it's rather clumsy: check if ->iov_offset is non-zero (i.e. that pipe is not empty) if so, get the corresponding pipe_buffer and check its ->ops if it's &default_pipe_buf_ops, we have an anon buffer. Let's replace the use of ->iov_offset (which is nowhere near similar to its role for other flavours) with signed field (->last_offset), with the following rules: empty, no buffers occupied: 0 anon, with bytes up to N-1 filled: N zero-copy, with bytes up to N-1 filled: -N That way abs(i->last_offset) is equal to what used to be in i->iov_offset and empty vs. anon vs. zero-copy can be distinguished by the sign of i->last_offset. Checks for "should we extend the last buffer or should we start a new one?" become easier to follow that way. Note that most of the operations can only be done in a sane state - i.e. when the pipe has nothing past the current position of iterator. About the only thing that could be done outside of that state is iov_iter_advance(), which transitions to the sane state by truncating the pipe. There are only two cases where we leave the sane state: 1) iov_iter_get_pages()/iov_iter_get_pages_alloc(). Will be dealt with later, when we make get_pages advancing - the callers are actually happier that way. 2) iov_iter copied, then something is put into the copy. Since they share the underlying pipe, the original gets behind. When we decide that we are done with the copy (original is not usable until then) we advance the original. direct_io used to be done that way; nowadays it operates on the original and we do iov_iter_revert() to discard the excessive data. At the moment there's nothing in the kernel that could do that to ITER_PIPE iterators, so this reason for insane state is theoretical right now. Signed-off-by: Al Viro <[email protected]>
2022-08-08ITER_PIPE: clean iov_iter_revert()Al Viro1-46/+14
Fold pipe_truncate() into it, clean up. We can release buffers in the same loop where we walk backwards to the iterator beginning looking for the place where the new position will be. Signed-off-by: Al Viro <[email protected]>
2022-08-08ITER_PIPE: clean pipe_advance() upAl Viro1-17/+17
instead of setting ->iov_offset for new position and calling pipe_truncate() to adjust ->len of the last buffer and discard everything after it, adjust ->len at the same time we set ->iov_offset and use pipe_discard_from() to deal with buffers past that. Signed-off-by: Al Viro <[email protected]>
2022-08-08ITER_PIPE: lose iter_head argument of __pipe_get_pages()Al Viro1-4/+3
it's only used to get to the partial buffer we can add to, and that's always the last one, i.e. pipe->head - 1. Signed-off-by: Al Viro <[email protected]>
2022-08-08ITER_PIPE: fold push_pipe() into __pipe_get_pages()Al Viro1-55/+25
Expand the only remaining call of push_pipe() (in __pipe_get_pages()), combine it with the page-collecting loop there. Note that the only reason it's not a loop doing append_pipe() is that append_pipe() is advancing, while iov_iter_get_pages() is not. As soon as it switches to saner semantics, this thing will switch to using append_pipe(). Signed-off-by: Al Viro <[email protected]>
2022-08-08ITER_PIPE: allocate buffers as we go in copy-to-pipe primitivesAl Viro1-73/+98
New helper: append_pipe(). Extends the last buffer if possible, allocates a new one otherwise. Returns page and offset in it on success, NULL on failure. iov_iter is advanced past the data we've got. Use that instead of push_pipe() in copy-to-pipe primitives; they get simpler that way. Handling of short copy (in "mc" one) is done simply by iov_iter_revert() - iov_iter is in consistent state after that one, so we can use that. [Fix for braino caught by Liu Xinpeng <[email protected]> folded in] [another braino fix, this time in copy_pipe_to_iter() and pipe_zero(); caught by testcase from Hugh Dickins] Signed-off-by: Al Viro <[email protected]>
2022-08-08ITER_PIPE: helpers for adding pipe buffersAl Viro1-42/+46
There are only two kinds of pipe_buffer in the area used by ITER_PIPE. 1) anonymous - copy_to_iter() et.al. end up creating those and copying data there. They have zero ->offset, and their ->ops points to default_pipe_page_ops. 2) zero-copy ones - those come from copy_page_to_iter(), and page comes from caller. ->offset is also caller-supplied - it might be non-zero. ->ops points to page_cache_pipe_buf_ops. Move creation and insertion of those into helpers - push_anon(pipe, size) and push_page(pipe, page, offset, size) resp., separating them from the "could we avoid creating a new buffer by merging with the current head?" logics. Acked-by: Jeff Layton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08ITER_PIPE: helper for getting pipe buffer by indexAl Viro1-6/+9
pipe_buffer instances of a pipe are organized as a ring buffer, with power-of-2 size. Indices are kept *not* reduced modulo ring size, so the buffer refered to by index N is pipe->bufs[N & (pipe->ring_size - 1)]. Ring size can change over the lifetime of a pipe, but not while the pipe is locked. So for any iov_iter primitives it's a constant. Original conversion of pipes to this layout went overboard trying to microoptimize that - calculating pipe->ring_size - 1, storing it in a local variable and using through the function. In some cases it might be warranted, but most of the times it only obfuscates what's going on in there. Introduce a helper (pipe_buf(pipe, N)) that would encapsulate that and use it in the obvious cases. More will follow... Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Christian Brauner (Microsoft) <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08splice: stop abusing iov_iter_advance() to flush a pipeAl Viro1-5/+2
Use pipe_discard_from() explicitly in generic_file_read_iter(); don't bother with rather non-obvious use of iov_iter_advance() in there. Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Christian Brauner (Microsoft) <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08switch new_sync_{read,write}() to ITER_UBUFAl Viro1-4/+2
Reviewed-by: Christian Brauner (Microsoft) <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-08-08new iov_iter flavour - ITER_UBUFAl Viro12-31/+108
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(), checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC ones. We are going to expose the things like ->write_iter() et.al. to those in subsequent commits. New predicate (user_backed_iter()) that is true for ITER_IOVEC and ITER_UBUF; places like direct-IO handling should use that for checking that pages we modify after getting them from iov_iter_get_pages() would need to be dirtied. DO NOT assume that replacing iter_is_iovec() with user_backed_iter() will solve all problems - there's code that uses iter_is_iovec() to decide how to poke around in iov_iter guts and for that the predicate replacement obviously won't suffice. Signed-off-by: Al Viro <[email protected]>
2022-08-08Documentation/mm: add details about kmap_local_page() and preemptionFabio M. De Francesco1-4/+9
What happens if a thread is preempted after mapping pages with kmap_local_page() was questioned recently.[1] Commit f3ba3c710ac5 ("mm/highmem: Provide kmap_local*") from Thomas Gleixner explains clearly that on context switch, the maps of an outgoing task are removed and the map of the incoming task are restored and that kmap_local_page() can be invoked from both preemptible and atomic contexts.[2] Therefore, for the purpose to make it clearer that users can call kmap_local_page() from contexts that allow preemption, rework a couple of sentences and add further information in highmem.rst. [1] https://lore.kernel.org/lkml/5303077.Sb9uPGUboI@opensuse/ [2] https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fabio M. De Francesco <[email protected]> Suggested-by: Ira Weiny <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08highmem: delete a sentence from kmap_local_page() kdocsFabio M. De Francesco1-2/+1
kmap_local_page() should always be preferred in place of kmap() and kmap_atomic(). "Only use when really necessary." is not consistent with the Documentation/mm/highmem.rst and these kdocs it embeds. Therefore, delete the above-mentioned sentence from kdocs. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fabio M. De Francesco <[email protected]> Suggested-by: Ira Weiny <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08Documentation/mm: rrefer kmap_local_page() and avoid kmap()Fabio M. De Francesco1-0/+5
The reasoning for converting kmap() to kmap_local_page() was questioned recently.[1] There are two main problems with kmap(): (1) It comes with an overhead as mapping space is restricted and protected by a global lock for synchronization and (2) kmap() also requires global TLB invalidation when its pool wraps and it might block when the mapping space is fully utilized until a slot becomes available. Warn users to avoid the use of kmap() and instead use kmap_local_page(), by designing their code to map pages in the same context the mapping will be used. [1] https://lore.kernel.org/lkml/1891319.taCxCBeP46@opensuse/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fabio M. De Francesco <[email protected]> Suggested-by: Ira Weiny <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08Documentation/mm: avoid invalid use of addresses from kmap_local_page()Fabio M. De Francesco1-0/+7
Users of kmap_local_page() must be absolutely sure to not hand kernel virtual address obtained calling kmap_local_page() on highmem pages to other contexts because those pointers are thread local, therefore, they are no longer valid across different contexts. Extend the documentation of kmap_local_page() to warn users about the above-mentioned potential invalid use of pointers returned by kmap_local_page(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fabio M. De Francesco <[email protected]> Suggested-by: Ira Weiny <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08Documentation/mm: don't kmap*() pages which can't come from HIGHMEMFabio M. De Francesco1-0/+6
There is no need to kmap*() pages which are guaranteed to come from ZONE_NORMAL (or lower). Linux has currently several call sites of kmap{,_atomic,_local_page}() on pages which are clearly known which can't come from ZONE_HIGHMEM. Therefore, add a paragraph to highmem.rst, to explain better that a plain page_address() may be used for getting the address of pages which cannot come from ZONE_HIGHMEM, although it is always safe to use kmap_local_page() / kunmap_local() also on those pages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fabio M. De Francesco <[email protected]> Suggested-by: Ira Weiny <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08highmem: specify that kmap_local_page() is callable from interruptsFabio M. De Francesco1-1/+1
In a recent thread about converting kmap() to kmap_local_page(), the safety of calling kmap_local_page() was questioned.[1] "any context" should probably be enough detail for users who want to know whether or not kmap_local_page() can be called from interrupts. However, Linux still has kmap_atomic() which might make users think they must use the latter in interrupts. Add "including interrupts" for better clarity. [1] https://lore.kernel.org/lkml/3187836.aeNJFYEL58@opensuse/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fabio M. De Francesco <[email protected]> Suggested-by: Ira Weiny <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08highmem: remove unneeded spaces in kmap_local_page() kdocsFabio M. De Francesco1-1/+1
Patch series "highmem: Extend kmap_local_page() documentation", v2. The Highmem interface is evolving and the current documentation does not reflect the intended uses of each of the calls. Furthermore, after a recent series of reworks, the differences of the calls can still be confusing and may lead to the expanded use of calls which are deprecated. This series is the second round of changes towards an enhanced documentation of the Highmem's interface; at this stage the patches are only focused to kmap_local_page(). In addition it also contains some minor clean ups. This patch (of 7): In the kdocs of kmap_local_page(), the description of @page starts after several unnecessary spaces. Therefore, remove those spaces. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fabio M. De Francesco <[email protected]> Suggested-by: Ira Weiny <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm, hwpoison: enable memory error handling on 1GB hugepageNaoya Horiguchi3-18/+0
Now error handling code is prepared, so remove the blocking code and enable memory error handling on 1GB hugepage. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: kernel test robot <[email protected]> Cc: Liu Shixin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepageNaoya Horiguchi1-4/+6
Currently if memory_failure() (modified to remove blocking code with subsequent patch) is called on a page in some 1GB hugepage, memory error handling fails and the raw error page gets into leaked state. The impact is small in production systems (just leaked single 4kB page), but this limits the testability because unpoison doesn't work for it. We can no longer create 1GB hugepage on the 1GB physical address range with such leaked pages, that's not useful when testing on small systems. When a hwpoison page in a 1GB hugepage is handled, it's caught by the PageHWPoison check in free_pages_prepare() because the 1GB hugepage is broken down into raw error pages before coming to this point: if (unlikely(PageHWPoison(page)) && !order) { ... return false; } Then, the page is not sent to buddy and the page refcount is left 0. Originally this check is supposed to work when the error page is freed from page_handle_poison() (that is called from soft-offline), but now we are opening another path to call it, so the callers of __page_handle_poison() need to handle the case by considering the return value 0 as success. Then page refcount for hwpoison is properly incremented so unpoison works. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: kernel test robot <[email protected]> Cc: Liu Shixin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm, hwpoison: make __page_handle_poison returns intNaoya Horiguchi1-5/+11
__page_handle_poison() returns bool that shows whether take_page_off_buddy() has passed or not now. But we will want to distinguish another case of "dissolve has passed but taking off failed" by its return value. So change the type of the return value. No functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: kernel test robot <[email protected]> Cc: Liu Shixin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm, hwpoison: set PG_hwpoison for busy hugetlb pagesNaoya Horiguchi2-4/+5
If memory_failure() fails to grab page refcount on a hugetlb page because it's busy, it returns without setting PG_hwpoison on it. This not only loses a chance of error containment, but breaks the rule that action_result() should be called only when memory_failure() do any of handling work (even if that's just setting PG_hwpoison). This inconsistency could harm code maintainability. So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case. Link: https://lkml.kernel.org/r/[email protected] Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()") Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: kernel test robot <[email protected]> Cc: Liu Shixin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepageNaoya Horiguchi2-5/+56
Raw error info list needs to be removed when hwpoisoned hugetlb is unpoisoned. And unpoison handler needs to know how many errors there are in the target hugepage. So add them. HPageVmemmapOptimized(hpage) and HPageRawHwpUnreliable(hpage)) sometimes can't be unpoisoned, so skip them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Liu Shixin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm, hwpoison, hugetlb: support saving mechanism of raw error pagesNaoya Horiguchi3-13/+116
When handling memory error on a hugetlb page, the error handler tries to dissolve and turn it into 4kB pages. If it's successfully dissolved, PageHWPoison flag is moved to the raw error page, so that's all right. However, dissolve sometimes fails, then the error page is left as hwpoisoned hugepage. It's useful if we can retry to dissolve it to save healthy pages, but that's not possible now because the information about where the raw error pages is lost. Use the private field of a few tail pages to keep that information. The code path of shrinking hugepage pool uses this info to try delayed dissolve. In order to remember multiple errors in a hugepage, a singly-linked list originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed. Only simple operations (adding an entry or clearing all) are required and the list is assumed not to be very long, so this simple data structure should be enough. If we failed to save raw error info, the hwpoison hugepage has errors on unknown subpage, then this new saving mechanism does not work any more, so disable saving new raw error info and freeing hwpoison hugepages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Liu Shixin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entryNaoya Horiguchi2-3/+37
follow_pud_mask() does not support non-present pud entry now. As long as I tested on x86_64 server, follow_pud_mask() still simply returns no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe user-visible effect should happen. But generally we should call follow_huge_pud() for non-present pud entry for 1GB hugetlb page. Update pud_huge() and follow_huge_pud() to handle non-present pud entries. The changes are similar to previous works for pud entries commit e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()") and commit cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage"). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: kernel test robot <[email protected]> Cc: Liu Shixin <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm/hugetlb: check gigantic_page_runtime_supported() in ↵Naoya Horiguchi1-2/+1
return_unused_surplus_pages() Patch series "mm, hwpoison: enable 1GB hugepage support", v7. This patch (of 8): I found a weird state of 1GB hugepage pool, caused by the following procedure: - run a process reserving all free 1GB hugepages, - shrink free 1GB hugepage pool to zero (i.e. writing 0 to /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then - kill the reserving process. , then all the hugepages are free *and* surplus at the same time. $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages 3 $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages 3 $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages 0 $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages 3 This state is resolved by reserving and allocating the pages then freeing them again, so this seems not to result in serious problem. But it's a little surprising (shrinking pool suddenly fails). This behavior is caused by hstate_is_gigantic() check in return_unused_surplus_pages(). This was introduced so long ago in 2008 by commit aa888a74977a ("hugetlb: support larger than MAX_ORDER"), and at that time the gigantic pages were not supposed to be allocated/freed at run-time. Now kernel can support runtime allocation/free, so let's check gigantic_page_runtime_supported() together. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Liu Shixin <[email protected]> Cc: Yang Shi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Muchun Song <[email protected]> Cc: kernel test robot <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZEMuchun Song1-1/+1
There is already a macro PTRS_PER_PTE to represent the number of page table entries, just use it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm: hugetlb_vmemmap: move code comments to vmemmap_dedup.rstMuchun Song2-36/+49
All the comments which explains how HVO works are moved to vmemmap_dedup.rst since commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") except some comments above page_fixed_fake_head(). This commit moves those comments to vmemmap_dedup.rst and improve vmemmap_dedup.rst as well. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readabilityMuchun Song5-108/+102
There is a discussion about the name of hugetlb_vmemmap_alloc/free in thread [1]. The suggestion suggested by David is rename "alloc/free" to "optimize/restore" to make functionalities clearer to users, "optimize" means the function will optimize vmemmap pages, while "restore" means restoring its vmemmap pages discared before. This commit does this. Another discussion is the confusion RESERVE_VMEMMAP_NR isn't used explicitly for vmemmap_addr but implicitly for vmemmap_end in hugetlb_vmemmap_alloc/free. David suggested we can compute what hugetlb_vmemmap_init() does now at runtime. We do not need to worry for the overhead of computing at runtime since the calculation is simple enough and those functions are not in a hot path. This commit has the following improvements: 1) The function suffixed name ("optimize/restore") is more expressive. 2) The logic becomes less weird in hugetlb_vmemmap_optimize/restore(). 3) The hugetlb_vmemmap_init() does not need to be exported anymore. 4) A ->optimize_vmemmap_pages field in struct hstate is killed. 5) There is only one place where checks is_power_of_2(sizeof(struct page)) instead of two places. 6) Add more comments for hugetlb_vmemmap_optimize/restore(). 7) For external users, hugetlb_optimize_vmemmap_pages() is used for detecting if the HugeTLB's vmemmap pages is optimizable originally. In this commit, it is killed and we introduce a new helper hugetlb_vmemmap_optimizable() to replace it. The name is more expressive. Link: https://lore.kernel.org/all/[email protected]/ [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm: hugetlb_vmemmap: replace early_param() with core_param()Muchun Song1-8/+2
After the following commit: 78f39084b41d ("mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl") There is no order requirement between the parameter of "hugetlb_free_vmemmap" and "hugepages" since we have removed the check of whether HVO is enabled from hugetlb_vmemmap_init(). Therefore we can safely replace early_param() with core_param() to simplify the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.cMuchun Song3-407/+398
When I first introduced vmemmap manipulation functions related to HugeTLB, I thought those functions may be reused by other modules (e.g. using similar approach to optimize vmemmap pages, unfortunately, the DAX used the same approach but does not use those functions). After two years, we didn't see any other users. So move those functions to hugetlb_vmemmap.c. Code movement without any functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm: hugetlb_vmemmap: introduce the name HVOMuchun Song9-24/+23
It it inconvenient to mention the feature of optimizing vmemmap pages associated with HugeTLB pages when communicating with others since there is no specific or abbreviated name for it when it is first introduced. Let us give it a name HVO (HugeTLB Vmemmap Optimization) from now. This commit also updates the document about "hugetlb_free_vmemmap" by the way discussed in thread [1]. Link: https://lore.kernel.org/all/[email protected]/ [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handlingMuchun Song2-62/+9
We hold an another reference to hugetlb_optimize_vmemmap_key when making vmemmap_optimize_mode on, because we use static_key to tell memory_hotplug that memory_hotplug.memmap_on_memory should be overridden. However, this rule has gone when we have introduced PageVmemmapSelfHosted. Therefore, we could simplify vmemmap_optimize_mode handling by not holding an another reference to hugetlb_optimize_vmemmap_key. This also means that we not incur the extra page_fixed_fake_head checks if there are no vmemmap optinmized hugetlb pages after this change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-08-08mm: hugetlb_vmemmap: delete hugetlb_optimize_vmemmap_enabled()Muchun Song2-22/+5
Patch series "Simplify hugetlb vmemmap and improve its readability", v2. This series aims to simplify hugetlb vmemmap and improve its readability. This patch (of 8): The name hugetlb_optimize_vmemmap_enabled() a bit confusing as it tests two conditions (enabled and pages in use). Instead of coming up to an appropriate name, we could just delete it. There is already a discussion about deleting it in thread [1]. There is only one user of hugetlb_optimize_vmemmap_enabled() outside of hugetlb_vmemmap, that is flush_dcache_page() in arch/arm64/mm/flush.c. However, it does not need to call hugetlb_optimize_vmemmap_enabled() in flush_dcache_page() since HugeTLB pages are always fully mapped and only head page will be set PG_dcache_clean meaning only head page's flag may need to be cleared (see commit cf5a501d985b). So it is easy to remove hugetlb_optimize_vmemmap_enabled(). Link: https://lore.kernel.org/all/[email protected]/ [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>