blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2021-04-30	mm: memcontrol: switch to rstat	Johannes Weiner	2	-158/+127
	Replace the memory controller's custom hierarchical stats code with the generic rstat infrastructure provided by the cgroup core. The current implementation does batched upward propagation from the write side (i.e. as stats change). The per-cpu batches introduce an error, which is multiplied by the number of subgroups in a tree. In systems with many CPUs and sizable cgroup trees, the error can be large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 subgroups results in an error of up to 128M per stat item). This can entirely swallow allocation bursts inside a workload that the user is expecting to see reflected in the statistics. In the past, we've done read-side aggregation, where a memory.stat read would have to walk the entire subtree and add up per-cpu counts. This became problematic with lazily-freed cgroups: we could have large subtrees where most cgroups were entirely idle. Hence the switch to change-driven upward propagation. Unfortunately, it needed to trade accuracy for speed due to the write side being so hot. Rstat combines the best of both worlds: from the write side, it cheaply maintains a queue of cgroups that have pending changes, so that the read side can do selective tree aggregation. This way the reported stats will always be precise and recent as can be, while the aggregation can skip over potentially large numbers of idle cgroups. The way rstat works is that it implements a tree for tracking cgroups with pending local changes, as well as a flush function that walks the tree upwards. The controller then drives this by 1) telling rstat when a local cgroup stat changes (e.g. mod_memcg_state) and 2) when a flush is required to get uptodate hierarchy stats for a given subtree (e.g. when memory.stat is read). The controller also provides a flush callback that is called during the rstat flush walk for each cgroup and aggregates its local per-cpu counters and propagates them upwards. This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward aggregation. It removes 3 words from the per-cpu data. It eliminates memcg_exact_page_state(), since memcg_page_state() is now exact. [[email protected]: merge fix] [[email protected]: fix a sleep in atomic section problem] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Acked-by: Balbir Singh <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	cgroup: rstat: punt root-level optimization to individual controllers	Johannes Weiner	2	-31/+47
	Current users of the rstat code can source root-level statistics from the native counters of their respective subsystem, allowing them to forego aggregation at the root level. This optimization is currently implemented inside the generic rstat code, which doesn't track the root cgroup and doesn't invoke the subsystem flush callbacks on it. However, the memory controller cannot do this optimization, because cgroup1 breaks out memory specifically for the local level, including at the root level. In preparation for the memory controller switching to rstat, move the optimization from rstat core to the controllers. Afterwards, rstat will always track the root cgroup for changes and invoke the subsystem callbacks on it; and it's up to the subsystem to special-case and skip aggregation of the root cgroup if it can source this information through other, cheaper means. This is the case for the io controller and the cgroup base stats. In their respective flush callbacks, check whether the parent is the root cgroup, and if so, skip the unnecessary upward propagation. The extra cost of tracking the root cgroup is negligible: on stat changes, we actually remove a branch that checks for the root. The queueing for a flush touches only per-cpu data, and only the first stat change since a flush requires a (per-cpu) lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	cgroup: rstat: support cgroup1	Johannes Weiner	2	-15/+21
	Rstat currently only supports the default hierarchy in cgroup2. In order to replace memcg's private stats infrastructure - used in both cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. The initialization and destruction callbacks for regular cgroups are already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. The initialization of the root cgroup is currently hardcoded to only handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() and cgroup_destroy_root() to handle the default root as well as the various cgroup1 roots we may set up during mounting. The linking of css to cgroups happens in code shared between cgroup1 and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. Linkage of the root css to the root cgroup is a bit trickier: per default, the root css of a subsystem controller belongs to the default hierarchy (i.e. the cgroup2 root). When a controller is mounted in its cgroup1 version, the root css is stolen and moved to the cgroup1 root; on unmount, the css moves back to the default hierarchy. Annotate rebind_subsystems() to move the root css linkage along between roots. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: memcontrol: privatize memcg_page_state query functions	Johannes Weiner	2	-44/+32
	There are no users outside of the memory controller itself. The rest of the kernel cares either about node or lruvec stats. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: memcontrol: kill mem_cgroup_nodeinfo()	Johannes Weiner	2	-17/+12
	No need to encapsulate a simple struct member access. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: memcontrol: fix cpuhotplug statistics flushing	Johannes Weiner	1	-14/+21
	Patch series "mm: memcontrol: switch to rstat", v3. This series converts memcg stats tracking to the streamlined rstat infrastructure provided by the cgroup core code. rstat is already used by the CPU controller and the IO controller. This change is motivated by recent accuracy problems in memcg's custom stats code, as well as the benefits of sharing common infra with other controllers. The current memcg implementation does batched tree aggregation on the write side: local stat changes are cached in per-cpu counters, which are then propagated upward in batches when a threshold (32 pages) is exceeded. This is cheap, but the error introduced by the lazy upward propagation adds up: 32 pages times CPUs times cgroups in the subtree. We've had complaints from service owners that the stats do not reliably track and react to allocation behavior as expected, sometimes swallowing the results of entire test applications. The original memcg stat implementation used to do tree aggregation exclusively on the read side: local stats would only ever be tracked in per-cpu counters, and a memory.stat read would iterate the entire subtree and sum those counters up. This didn't keep up with the times: - Cgroup trees are much bigger now. We switched to lazily-freed cgroups, where deleted groups would hang around until their remaining page cache has been reclaimed. This can result in large subtrees that are expensive to walk, while most of the groups are idle and their statistics don't change much anymore. - Automated monitoring increased. With the proliferation of userspace oom killing, proactive reclaim, and higher-resolution logging of workload trends in general, top-level stat files are polled at least once a second in many deployments. - The lifetime of cgroups got shorter. Where most cgroup setups in the past would have a few large policy-oriented cgroups for everything running on the system, newer cgroup deployments tend to create one group per application - which gets deleted again as the processes exit. An aggregation scheme that doesn't retain child data inside the parents loses event history of the subtree. Rstat addresses all three of those concerns through intelligent, persistent read-side aggregation. As statistics change at the local level, rstat tracks - on a per-cpu basis - only those parts of a subtree that have changes pending and require aggregation. The actual aggregation occurs on the colder read side - which can now skip over (potentially large) numbers of recently idle cgroups. === The test_kmem cgroup selftest is currently failing due to excessive cumulative vmstat drift from 100 subgroups: ok 1 test_kmem_basic memory.current = 8810496 slab + anon + file + kernel_stack = 17074568 slab = 6101384 anon = 946176 file = 0 kernel_stack = 10027008 not ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic As you can see, memory.stat items far exceed memory.current. The kernel stack alone is bigger than all of charged memory. That's because the memory of the test has been uncharged from memory.current, but the negative vmstat deltas are still sitting in the percpu caches. The test at this time isn't even counting percpu, pagetables etc. yet, which would further contribute to the error. The last patch in the series updates the test to include them - as well as reduces the vmstat tolerances in general to only expect page_counter batching. With all patches applied, the (now more stringent) test succeeds: ok 1 test_kmem_basic ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic === A kernel build test confirms that overhead is comparable. Two kernels are built simultaneously in a nested tree with several idle siblings: root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16) `- build-b (defconfig, make -j16) `- idle-1 `- ... `- idle-9 During the builds, kernelbuild/memory.stat is read once a second. A perf diff shows that the changes in cycle distribution is minimal. Top 10 kernel symbols: 0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated 0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0 0.16% -0.04% [kernel.kallsyms] [k] release_pages 0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events 0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0 0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size 0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault === The on-demand aggregated stats are now fully accurate: $ grep -e nr_inactive_file /proc/vmstat \| awk '{print($1,$2*4096)}'; \ grep -e inactive_file /sys/fs/cgroup/memory.stat vanilla: patched: nr_inactive_file 1574105088 nr_inactive_file 1027801088 inactive_file 1577410560 inactive_file 1027801088 === This patch (of 8): The memcg hotunplug callback erroneously flushes counts on the local CPU, not the counts of the CPU going away; those counts will be lost. Flush the CPU that is actually going away. Also simplify the code a bit by using mod_memcg_state() and count_memcg_events() instead of open-coding the upward flush - this is comparable to how vmstat.c handles hotunplug flushing. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: a983b5ebee572 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	memcg: enable memcg oom-kill for __GFP_NOFAIL	Shakeel Butt	1	-3/+0
	In the era of async memcg oom-killer, the commit a0d8b00a3381 ("mm: memcg: do not declare OOM from __GFP_NOFAIL allocations") added the code to skip memcg oom-killer for __GFP_NOFAIL allocations. The reason was that the __GFP_NOFAIL callers will not enter aync oom synchronization path and will keep the task marked as in memcg oom. At that time the tasks marked in memcg oom can bypass the memcg limits and the oom synchronization would have happened later in the later userspace triggered page fault. Thus letting the task marked as under memcg oom bypass the memcg limit for arbitrary time. With the synchronous memcg oom-killer (commit 29ef680ae7c21 ("memcg, oom: move out_of_memory back to the charge path")) and not letting the task marked under memcg oom to bypass the memcg limits (commit 1f14c1ac19aa4 ("mm: memcg: do not allow task about to OOM kill to bypass the limit")), we can again allow __GFP_NOFAIL allocations to trigger memcg oom-kill. This will make memcg oom behavior closer to page allocator oom behavior. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	memcg: cleanup root memcg checks	Shakeel Butt	1	-2/+2
	Replace the implicit checking of root memcg with explicit root memcg checking i.e. !css->parent with mem_cgroup_is_root(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: memcontrol: fix kernel stack account	Muchun Song	1	-5/+8
	For simplification commit 991e7673859e ("mm: memcontrol: account kernel stack per node") changed the per zone vmalloc backed stack pages accounting to per node. By doing that we have lost a certain precision because those pages might live in different NUMA nodes. In the end NR_KERNEL_STACK_KB exported to the userspace might be over estimated on some nodes while underestimated on others. But this is not a real world problem, just a problem found by reading the code. So there is no actual data to showing how much impact it has on users. This doesn't impose any real problem to correctnes of the kernel behavior as the counter is not used for any internal processing but it can cause some confusion to the userspace. Address the problem by accounting each vmalloc backing page to its own node. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/memremap.c: fix improper SPDX comment style	Zhiyuan Dai	1	-1/+1
	Replace /* */ comment with //, fix SPDX comment style. see: Documentation/process/license-rules.rst Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhiyuan Dai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: gup: remove FOLL_SPLIT	Yang Shi	3	-32/+2
	Since commit 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT") and commit ba925fa35057 ("s390/gmap: improve THP splitting") FOLL_SPLIT has not been used anymore. Remove the dead code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: John Hubbard <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	RDMA/umem: batch page unpin in __ib_umem_release()	Joao Martins	1	-6/+6
	Use the newly added unpin_user_page_range_dirty_lock() for more quickly unpinning a consecutive range of pages represented as compound pages. This will also calculate number of pages to unpin (for the tail pages which matching head page) and thus batch the refcount update. Running a test program which calls memory range reg/unreg on a region 1G in size and measures cost of both operations together (in a guest using rxe) with THP and hugetlbfs: Before: 590 rounds in 5.003 sec: 8480.335 usec / round 6898 rounds in 60.001 sec: 8698.367 usec / round After: 2688 rounds in 5.002 sec: 1860.786 usec / round 32517 rounds in 60.001 sec: 1845.225 usec / round Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joao Martins <[email protected]> Acked-by: Jason Gunthorpe <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Doug Ledford <[email protected]> Cc: John Hubbard <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/gup: add a range variant of unpin_user_pages_dirty_lock()	Joao Martins	2	-0/+64
	Add an unpin_user_page_range_dirty_lock() API which takes a starting page and how many consecutive pages we want to unpin and optionally dirty. To that end, define another iterator for_each_compound_range() that operates in page ranges as opposed to page array. For users (like RDMA mr_dereg) where each sg represents a contiguous set of pages, we're able to more efficiently unpin pages without having to supply an array of pages much of what happens today with unpin_user_pages(). Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Jason Gunthorpe <[email protected]> Signed-off-by: Joao Martins <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/gup: decrement head page once for group of subpages	Joao Martins	1	-18/+11
	Rather than decrementing the head page refcount one by one, we walk the page array and checking which belong to the same compound_head. Later on we decrement the calculated amount of references in a single write to the head page. To that end switch to for_each_compound_head() does most of the work. set_page_dirty() needs no adjustment as it's a nop for non-dirty head pages and it doesn't operate on tail pages. This considerably improves unpinning of pages with THP and hugetlbfs: - THP gup_test -t -m 16384 -r 10 [-L\|-a] -S -n 512 -w PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us - 16G with 1G huge page size gup_test -f /mnt/huge/file -m 16384 -r 10 [-L\|-a] -S -n 512 -w PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joao Martins <[email protected]> Reviewed-by: John Hubbard <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/gup: add compound page list iterator	Joao Martins	1	-0/+26
	Patch series "mm/gup: page unpining improvements", v4. This series improves page unpinning, with an eye on improving MR deregistration for big swaths of memory (which is bound by the page unpining), particularly: 1) Decrement the head page by @ntails and thus reducing a lot the number of atomic operations per compound page. This is done by comparing individual tail pages heads, and counting number of consecutive tails on which they match heads and based on that update head page refcount. Should have a visible improvement in all page (un)pinners which use compound pages 2) Introducing a new API for unpinning page ranges (to avoid the trick in the previous item and be based on math), and use that in RDMA ib_mem_release (used for mr deregistration). Performance improvements: unpin_user_pages() for hugetlbfs and THP improves ~3x (through gup_test) and RDMA MR dereg improves ~4.5x with the new API. See patches 2 and 4 for those. This patch (of 4): Add a helper that iterates over head pages in a list of pages. It essentially counts the tails until the next page to process has a different head that the current. This is going to be used by unpin_user_pages() family of functions, to batch the head page refcount updates once for all passed consecutive tail pages. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joao Martins <[email protected]> Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: John Hubbard <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/msync: exit early when the flags is an MS_ASYNC and start < vm_start	Nikita Ermakov	1	-1/+5
	If an unmapped region was found and the flag is MS_ASYNC (without MS_INVALIDATE) there is nothing to do and the result would be always -ENOMEM, so return immediately. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nikita Ermakov <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/filemap: update stale comment	Rui Sun	1	-1/+1
	Commit a6de4b4873e1 ("mm: convert find_get_entry to return the head page") uses @index instead of @offset, but the comment is stale, update it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rui Sun <[email protected]> Signed-off-by: Shaokun Zhang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: move page_mapping_file to pagemap.h	Matthew Wilcox (Oracle)	12	-11/+19
	page_mapping_file() is only used by some architectures, and then it is usually only used in one place. Make it a static inline function so other architectures don't have to carry this dead code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: page-writeback: simplify memcg handling in test_clear_page_writeback()	Johannes Weiner	4	-60/+19
	Page writeback doesn't hold a page reference, which allows truncate to free a page the second PageWriteback is cleared. This used to require special attention in test_clear_page_writeback(), where we had to be careful not to rely on the unstable page->memcg binding and look up all the necessary information before clearing the writeback flag. Since commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)") test_clear_page_writeback() is called with an explicit reference on the page, and this dance is no longer needed. Use unlock_page_memcg() and dec_lruvec_page_state() directly. This removes the last user of the lock_page_memcg() return value, change it to void. Touch up the comments in there as well. This also removes the last extern user of __unlock_page_memcg(), make it static. Further, it removes the last user of dec_lruvec_state(), delete it, along with a few other unused helpers. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Hugh Dickins <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/filemap: drop check for truncated page after I/O	Matthew Wilcox (Oracle)	1	-2/+0
	If the I/O completed successfully, the page will remain Uptodate, even if it is subsequently truncated. If the I/O completed with an error, this check would cause us to retry the I/O if the page were truncated before we woke up. There is no need to retry the I/O; the I/O to fill the page failed, so we can legitimately just return -EIO. This code was originally added by commit 56f0d5fe6851 ("[PATCH] readpage-vs-invalidate fix") in 2005 (this commit ID is from the linux-fullhistory tree; it is also commit ba1f08f14b52 in tglx-history). At the time, truncate_complete_page() called ClearPageUptodate(), and so this was fixing a real bug. In 2008, commit 84209e02de48 ("mm: dont clear PG_uptodate on truncate/invalidate") removed the call to ClearPageUptodate, and this check has been unnecessary ever since. It doesn't do any real harm, but there's no need to keep it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: William Kucharski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/filemap: use filemap_read_page in filemap_fault	Matthew Wilcox (Oracle)	1	-9/+1
	After splitting generic_file_buffered_read() into smaller parts, it turns out we can reuse one of the parts in filemap_fault(). This fixes an oversight -- waiting for the I/O to complete is now interruptible by a fatal signal. And it saves us a few bytes of text in an unlikely path. $ ./scripts/bloat-o-meter before.o after.o add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-207 (-207) Function old new delta filemap_fault 2187 1980 -207 Total: Before=37491, After=37284, chg -0.55% Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	iomap: use filemap_range_needs_writeback() for O_DIRECT reads	Jens Axboe	1	-8/+16
	For reads, use the better variant of checking for the need to call filemap_write_and_wait_range() when doing O_DIRECT. This avoids falling back to the slow path for IOCB_NOWAIT, if there are no pages to wait for (or write out). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: use filemap_range_needs_writeback() for O_DIRECT reads	Jens Axboe	1	-2/+2
	For the generic page cache read helper, use the better variant of checking for the need to call filemap_write_and_wait_range() when doing O_DIRECT reads. This avoids falling back to the slow path for IOCB_NOWAIT, if there are no pages to wait for (or write out). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: provide filemap_range_needs_writeback() helper	Jens Axboe	2	-0/+45
	Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3. An internal workload complained because it was using too much CPU, and when I took a look, we had a lot of io_uring workers going to town. For an async buffered read like workload, I am normally expecting _zero_ offloads to a worker thread, but this one had tons of them. I'd drop caches and things would look good again, but then a minute later we'd regress back to using workers. Turns out that every minute something was reading parts of the device, which would add page cache for that inode. I put patches like these in for our kernel, and the problem was solved. Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache entries for the given range. This causes unnecessary work from the callers side, when the IO could have been issued totally fine without blocking on writeback when there is none. This patch (of 3): For O_DIRECT reads/writes, we check if we need to issue a call to filemap_write_and_wait_range() to issue and/or wait for writeback for any page in the given range. The existing mechanism just checks for a page in the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the slow path (and needing retry) if there's just a clean page cache page in the range. Provide filemap_range_needs_writeback() which tries a little harder to check if we actually need to issue and/or wait for writeback in the range. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/memtest: add ARCH_USE_MEMTEST	Anshuman Khandual	7	-1/+14
	early_memtest() does not get called from all architectures. Hence enabling CONFIG_MEMTEST and providing a valid memtest=[1..N] kernel command line option might not trigger the memory pattern tests as would be expected in normal circumstances. This situation is misleading. The change here prevents the above mentioned problem after introducing a new config option ARCH_USE_MEMTEST that should be subscribed on platforms that call early_memtest(), in order to enable the config CONFIG_MEMTEST. Conversely CONFIG_MEMTEST cannot be enabled on platforms where it would not be tested anyway. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: Catalin Marinas <[email protected]> (arm64) Reviewed-by: Max Filippov <[email protected]> Cc: Russell King <[email protected]> Cc: Will Deacon <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Chris Zankel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: page_poison: print page info when corruption is caught	Sergei Trofimovich	1	-2/+4
	When page_poison detects page corruption it's useful to see who freed a page recently to have a guess where write-after-free corruption happens. After this change corruption report has extra page data. Example report from real corruption (includes only page_pwner part): pagealloc: memory corruption e00000014cd61d10: 11 00 00 00 00 00 00 00 30 1d d2 ff ff 0f 00 60 ........0......` e00000014cd61d20: b0 1d d2 ff ff 0f 00 60 90 fe 1c 00 08 00 00 20 .......`....... ... CPU: 1 PID: 220402 Comm: cc1plus Not tainted 5.12.0-rc5-00107-g9720c6f59ecf #245 Hardware name: hp server rx3600, BIOS 04.03 04/08/2008 ... Call Trace: [<a000000100015210>] show_stack+0x90/0xc0 [<a000000101163390>] dump_stack+0x150/0x1c0 [<a0000001003f1e90>] __kernel_unpoison_pages+0x410/0x440 [<a0000001003c2460>] get_page_from_freelist+0x1460/0x2ca0 [<a0000001003c6be0>] __alloc_pages_nodemask+0x3c0/0x660 [<a0000001003ed690>] alloc_pages_vma+0xb0/0x500 [<a00000010037deb0>] __handle_mm_fault+0x1230/0x1fe0 [<a00000010037ef70>] handle_mm_fault+0x310/0x4e0 [<a00000010005dc70>] ia64_do_page_fault+0x1f0/0xb80 [<a00000010000ca00>] ia64_leave_kernel+0x0/0x270 page_owner tracks the page as freed page allocated via order 0, migratetype Movable, gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE\|__GFP_ZERO), pid 37, ts 8173444098740 __reset_page_owner+0x40/0x200 free_pcp_prepare+0x4d0/0x600 free_unref_page+0x20/0x1c0 __put_page+0x110/0x1a0 migrate_pages+0x16d0/0x1dc0 compact_zone+0xfc0/0x1aa0 proactive_compact_node+0xd0/0x1e0 kcompactd+0x550/0x600 kthread+0x2c0/0x2e0 call_payload+0x50/0x80 Here we can see that page was freed by page migration but something managed to write to it afterwards. [[email protected]: s/dump_page_owner/dump_page/, per Vlastimil] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergei Trofimovich <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: page_owner: detect page_owner recursion via task_struct	Sergei Trofimovich	2	-22/+14
	Before the change page_owner recursion was detected via fetching backtrace and inspecting it for current instruction pointer. It has a few problems: - it is slightly slow as it requires extra backtrace and a linear stack scan of the result - it is too late to check if backtrace fetching required memory allocation itself (ia64's unwinder requires it). To simplify recursion tracking let's use page_owner recursion flag in 'struct task_struct'. The change make page_owner=on work on ia64 by avoiding infinite recursion in: kmalloc() -> __set_page_owner() -> save_stack() -> unwind() [ia64-specific] -> build_script() -> kmalloc() -> __set_page_owner() [we short-circuit here] -> save_stack() -> unwind() [recursion] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergei Trofimovich <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Ben Segall <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: page_owner: use kstrtobool() to parse bool option	Sergei Trofimovich	1	-7/+1
	I tried to use page_owner=1 for a while noticed too late it had no effect as opposed to similar init_on_alloc=1 (these work). Let's make them consistent. The change decreses binary size slightly: text data bss dec hex filename 12408 321 17 12746 31ca mm/page_owner.o.before 12320 321 17 12658 3172 mm/page_owner.o.after Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergei Trofimovich <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm: page_owner: fetch backtrace only for tracked pages	Sergei Trofimovich	1	-3/+3
	Very minor optimization. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergei Trofimovich <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm, page_owner: remove unused parameter in __set_page_owner_handle	zhongjiang-ali	1	-5/+5
	Since commit 5556cfe8d994 ("mm, page_owner: fix off-by-one error in __set_page_owner_handle()") introduced, the parameter 'page' will not used, hence it need to be removed. Link: https://lkml.kernel.org/r/1616602022-43545-1-git-send-email-zhongjiang-ali@linux.alibaba.com Signed-off-by: zhongjiang-ali <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/page_owner: record the timestamp of all pages during free	Georgi Djakov	2	-5/+9
	Collect the time when each allocation is freed, to help with memory analysis with kdump/ramdump. Add the timestamp also in the page_owner debugfs file and print it in dump_page(). Having another timestamp when we free the page helps for debugging page migration issues. For example both alloc and free timestamps being the same can gave hints that there is an issue with migrating memory, as opposed to a page just being dropped during migration. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Georgi Djakov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/kmemleak.c: fix a typo	Bhaskar Chowdhury	1	-1/+1
	s/interruptable/interruptible/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Bhaskar Chowdhury <[email protected]> Acked-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/slub.c: trivial typo fixes	Bhaskar Chowdhury	1	-4/+4
	s/operatios/operations/ s/Mininum/Minimum/ s/mininum/minimum/ ......two different places. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Bhaskar Chowdhury <[email protected]> Acked-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm, slub: enable slub_debug static key when creating cache with explicit ↵	Vlastimil Babka	1	-0/+9
	debug flags Commit ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()") introduced a static key to optimize the case where no debugging is enabled for any cache. The static key is enabled when slub_debug boot parameter is passed, or CONFIG_SLUB_DEBUG_ON enabled. However, some caches might be created with one or more debugging flags explicitly passed to kmem_cache_create(), and the commit missed this. Thus the debugging functionality would not be actually performed for these caches unless the static key gets enabled by boot param or config. This patch fixes it by checking for debugging flags passed to kmem_cache_create() and enabling the static key accordingly. Note such explicit debugging flags should not be used outside of debugging and testing as they will now enable the static key globally. btrfs_init_cachep() creates a cache with SLAB_RED_ZONE but that's a mistake that's being corrected [1]. rcu_torture_stats() creates a cache with SLAB_STORE_USER, but that is a testing module so it's OK and will start working as intended after this patch. Also note that in case of backports to kernels before v5.12 that don't have 59450bbc12be ("mm, slab, slub: stop taking cpu hotplug lock"), static_branch_enable_cpuslocked() should be used. [1] https://lore.kernel.org/linux-btrfs/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()") Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Oliver Glitta <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	mm/slab_common: provide "slab_merge" option for ↵	Rafael Aquini	2	-0/+15
	!IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT) builds This is a minor addition to the allocator setup options to provide a simple way to on demand enable back cache merging for builds that by default run with CONFIG_SLAB_MERGE_DEFAULT not set. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rafael Aquini <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	watchdog: cleanup handling of false positives	Petr Mladek	1	-12/+8
	Commit d6ad3e286d2c ("softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume") introduced touch_softlockup_watchdog_sync(). It solved a problem when the watchdog was touched in an atomic context, the timer callback was proceed right after releasing interrupts, and the local clock has not been updated yet. In this case, sched_clock_tick() was called in watchdog_timer_fn() before updating the timer. So far so good. Later commit 5d1c0f4a80a6 ("watchdog: add check for suspended vm in softlockup detector") added two kvm_check_and_clear_guest_paused() calls. They touch the watchdog when the guest has been sleeping. The code makes my head spin around. Scenario 1: + guest did sleep: + PVCLOCK_GUEST_STOPPED is set + 1st watchdog_timer_fn() invocation: + the watchdog is not touched yet + is_softlockup() returns too big delay + kvm_check_and_clear_guest_paused(): + clear PVCLOCK_GUEST_STOPPED + call touch_softlockup_watchdog_sync() + set SOFTLOCKUP_DELAY_REPORT + set softlockup_touch_sync + return from the timer callback + 2nd watchdog_timer_fn() invocation: + call sched_clock_tick() even though it is not needed. The timer callback was invoked again only because the clock has already been updated in the meantime. + call kvm_check_and_clear_guest_paused() that does nothing because PVCLOCK_GUEST_STOPPED has been cleared already. + call update_report_ts() and return. This is fine. Except that sched_clock_tick() might allow to set it already during the 1st invocation. Scenario 2: + guest did sleep + 1st watchdog_timer_fn() invocation + same as in 1st scenario + guest did sleep again: + set PVCLOCK_GUEST_STOPPED again + 2nd watchdog_timer_fn() invocation + SOFTLOCKUP_DELAY_REPORT is set from 1st invocation + call sched_clock_tick() + call kvm_check_and_clear_guest_paused() + clear PVCLOCK_GUEST_STOPPED + call touch_softlockup_watchdog_sync() + set SOFTLOCKUP_DELAY_REPORT + set softlockup_touch_sync + call update_report_ts() (set real timestamp immediately) + return from the timer callback + 3rd watchdog_timer_fn() invocation + timestamp is set from 2nd invocation + softlockup_touch_sync is set but not checked because the real timestamp is already set Make the code more straightforward: 1. Always call kvm_check_and_clear_guest_paused() at the very beginning to handle PVCLOCK_GUEST_STOPPED. It touches the watchdog when the quest did sleep. 2. Handle the situation when the watchdog has been touched (SOFTLOCKUP_DELAY_REPORT is set). Call sched_clock_tick() when touch_*sync() variant was used. It makes sure that the timestamp will be up to date even when it has been touched in atomic context or quest did sleep. As a result, kvm_check_and_clear_guest_paused() is called on a single location. And the right timestamp is always set when returning from the timer callback. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Laurence Oberman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vincent Whitchurch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	watchdog: fix barriers when printing backtraces from all CPUs	Petr Mladek	1	-11/+6
	Any parallel softlockup reports are skipped when one CPU is already printing backtraces from all CPUs. The exclusive rights are synchronized using one bit in soft_lockup_nmi_warn. There is also one memory barrier that does not make much sense. Use two barriers on the right location to prevent mixing two reports. [[email protected]: use bit lock operations to prevent multiple soft-lockup reports] Link: https://lkml.kernel.org/r/YFSVsLGVWMXTvlbk@alley Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Laurence Oberman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vincent Whitchurch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	watchdog/softlockup: remove logic that tried to prevent repeated reports	Petr Mladek	1	-12/+2
	The softlockup detector does some gymnastic with the variable soft_watchdog_warn. It was added by the commit 58687acba59266735ad ("lockup_detector: Combine nmi_watchdog and softlockup detector"). The purpose is not completely clear. There are the following clues. They describe the situation how it looked after the above mentioned commit: 1. The variable was checked with a comment "only warn once". 2. The variable was set when softlockup was reported. It was cleared only when the CPU was not longer in the softlockup state. 3. watchdog_touch_ts was not explicitly updated when the softlockup was reported. Without this variable, the report would normally be printed again during every following watchdog_timer_fn() invocation. The logic has got even more tangled up by the commit ed235875e2ca98 ("kernel/watchdog.c: print traces for all cpus on lockup detection"). After this commit, soft_watchdog_warn is set only when softlockup_all_cpu_backtrace is enabled. But multiple reports from all CPUs are prevented by a new variable soft_lockup_nmi_warn. Conclusion: The variable probably never worked as intended. In each case, it has not worked last many years because the softlockup was reported repeatedly after the full period defined by watchdog_thresh. The reason is that watchdog gets touched in many known slow paths, for example, in printk_stack_address(). This code is called also when printing the softlockup report. It means that the watchdog timestamp gets updated after each report. Solution: Simply remove the logic. People want the periodic report anyway. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Laurence Oberman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vincent Whitchurch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	watchdog/softlockup: report the overall time of softlockups	Petr Mladek	1	-12/+28
	The softlockup detector currently shows the time spent since the last report. As a result it is not clear whether a CPU is infinitely hogged by a single task or if it is a repeated event. The situation can be simulated with a simply busy loop: while (true) cpu_relax(); The softlockup detector produces: [ 168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865] [ 196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865] [ 236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865] But it should be, something like: [ 480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943] [ 508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943] [ 548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943] [ 576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943] For the better output, add an additional timestamp of the last report. Only this timestamp is reset when the watchdog is intentionally touched from slow code paths or when printing the report. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Laurence Oberman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vincent Whitchurch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	watchdog: explicitly update timestamp when reporting softlockup	Petr Mladek	1	-0/+3
	The softlockup situation might stay for a long time or even forever. When it happens, the softlockup debug messages are printed in regular intervals defined by get_softlockup_thresh(). There is a mystery. The repeated message is printed after the full interval that is defined by get_softlockup_thresh(). But the timer callback is called more often as defined by sample_period. The code looks like the soflockup should get reported in every sample_period when it was once behind the thresh. It works only by chance. The watchdog is touched when printing the stall report, for example, in printk_stack_address(). Make the behavior clear and predictable by explicitly updating the timestamp in watchdog_timer_fn() when the report gets printed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Laurence Oberman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vincent Whitchurch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	watchdog: rename __touch_watchdog() to a better descriptive name	Petr Mladek	1	-4/+4
	Patch series "watchdog/softlockup: Report overall time and some cleanup", v2. I dug deep into the softlockup watchdog history when time permitted this year. And reworked the patchset that fixed timestamps and cleaned up the code[2]. I split it into very small steps and did even more code clean up. The result looks quite strightforward and I am pretty confident with the changes. [1] v2: https://lore.kernel.org/r/[email protected] [2] v1: https://lore.kernel.org/r/[email protected] This patch (of 6): There are many touch_watchdog() functions. They are called in situations where the watchdog could report false positives or create unnecessary noise. For example, when CPU is entering idle mode, a virtual machine is stopped, or a lot of messages are printed in the atomic context. These functions set SOFTLOCKUP_RESET instead of a real timestamp. It allows to call them even in a context where jiffies might be outdated. For example, in an atomic context. The real timestamp is set by __touch_watchdog() that is called from the watchdog timer callback. Rename this callback to update_touch_ts(). It better describes the effect and clearly distinguish is from the other touch_watchdog() functions. Another motivation is that two timestamps are going to be used. One will be used for the total softlockup time. The other will be used to measure time since the last report. The new function name will help to distinguish which timestamp is being updated. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Laurence Oberman <[email protected]> Cc: Vincent Whitchurch <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	vfs: fs_parser: clean up kernel-doc warnings	Randy Dunlap	1	-1/+1
	Fix kernel-doc notation function arguments to eliminate two kernel-doc warnings: fs_parser.c:322: warning: Excess function parameter 'name' description in 'validate_constant_table' fs_parser.c:367: warning: Function parameter or member 'name' not described in 'fs_validate_description' Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Cc: Alexander Viro <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	kfifo: fix ternary sign extension bugs	Dan Carpenter	3	-6/+18
	The intent with this code was to return negative error codes but instead it returns positives. The problem is how type promotion works with ternary operations. These functions return long, "ret" is an int and "copied" is a u32. The negative error code is first cast to u32 so it becomes a high positive and then cast to long where it's still a positive. We could fix this by declaring "ret" as a ssize_t but let's just get rid of the ternaries instead. Link: https://lkml.kernel.org/r/YIE+/cK1tBzSuQPU@mwanda Fixes: 5bf2b19320ec ("kfifo: add example files to the kernel sample directory") Signed-off-by: Dan Carpenter <[email protected]> Cc: Stefani Seibold <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	ocfs2/dlm: remove unused function	Jiapeng Chong	1	-7/+0
	Fix the following clang warning: fs/ocfs2/dlm/dlmrecovery.c:129:20: warning: unused function 'dlm_reset_recovery' [-Wunused-function]. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jiapeng Chong <[email protected]> Reported-by: Abaci Robot <[email protected]> Reviewed-by: Wengang Wang <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	ocfs2: fix a typo	Bhaskar Chowdhury	1	-1/+1
	s/cluter/cluster/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Bhaskar Chowdhury <[email protected]> Acked-by: Joseph Qi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	ocfs2: map flags directly in flags_to_o2dlm()	Joseph Qi	1	-18/+18
	Use macro map_flag() is tricky and coccicheck outputs the following warning: fs/ocfs2/stack_o2cb.c:69:5-16: Unneeded variable: "o2dlm_flags" So map flags directly in flags_to_o2dlm() to make coccicheck happy. And remove BUG_ON() here as well to simplify code since it runs well a long time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joseph Qi <[email protected]> Reviewed-by: Wengang Wang <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	ocfs2: replace DEFINE_SIMPLE_ATTRIBUTE with DEFINE_DEBUGFS_ATTRIBUTE	Yang Li	1	-1/+1
	Fix the following coccicheck warning: fs/ocfs2/blockcheck.c:232:0-23: WARNING: blockcheck_fops should be defined with DEFINE_DEBUGFS_ATTRIBUTE Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Li <[email protected]> Reported-by: Abaci Robot <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	arch/sh/include/asm/tlb.h: remove duplicate include	Zhang Yunkai	1	-8/+2
	'asm-generic/tlb.h' included in 'asm/tlb.h' is duplicated. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhang Yunkai <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: Zhang Yunkai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	scripts: a new script for checking duplicate struct declaration	Wan Jiabing	1	-0/+53
	checkdeclares: find struct declared more than once. Inspired by checkincludes.pl. This script checks for duplicate struct declares. Note that this will not take into consideration macros, so you should run this only if you know you do have real dups and do not have them under #ifdef's. You could also just review the results. [[email protected]: fix usage message, grammar] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wan Jiabing <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-30	scripts/spelling.txt: add entries for recent discoveries	Tom Saeger	1	-1/+25
	Add a few entries for recent spelling fixes found. Opportunistically de-dupe: exeeds\|\|exceeds Link: https://lore.kernel.org/lkml/31acb3239b7ab8989db0c9951e8740050aef0205.1616727528.git.tom.saeger@oracle.com/ Link: https://lore.kernel.org/lkml/fa193b3c9e346ff3fc157b54802c29b25f79c402.1615597995.git.tom.saeger@oracle.com/ Link: https://lkml.kernel.org/r/4a594a9e1536b1d9e5ba57f684c1e41457dd383b.1616861645.git.tom.saeger@oracle.com Signed-off-by: Tom Saeger <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>