aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2021-09-03mm/migrate: add sysfs interface to enable reclaim migrationHuang Ying1-0/+4
Some method is obviously needed to enable reclaim-based migration. Just like traditional autonuma, there will be some workloads that will benefit like workloads with more "static" configurations where hot pages stay hot and cold pages stay cold. If pages come and go from the hot and cold sets, the benefits of this approach will be more limited. The benefits are truly workload-based and *not* hardware-based. We do not believe that there is a viable threshold where certain hardware configurations should have this mechanism enabled while others do not. To be conservative, earlier work defaulted to disable reclaim- based migration and did not include a mechanism to enable it. This proposes add a new sysfs file /sys/kernel/mm/numa/demotion_enabled as a method to enable it. We are open to any alternative that allows end users to enable this mechanism or disable it if workload harm is detected (just like traditional autonuma). Once this is enabled page demotion may move data to a NUMA node that does not fall into the cpuset of the allocating process. This could be construed to violate the guarantees of cpusets. However, since this is an opt-in mechanism, the assumption is that anyone enabling it is content to relax the guarantees. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Huang Ying <[email protected]> Originally-by: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Xu <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Keith Busch <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm/vmscan: add page demotion counterYang Shi1-0/+2
Account the number of demoted pages. Add pgdemote_kswapd and pgdemote_direct VM counters showed in /proc/vmstat. [ daveh: - __count_vm_events() a bit, and made them look at the THP size directly rather than getting data from migrate_pages() ] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Yang Shi <[email protected]> Reviewed-by: Wei Xu <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Keith Busch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm/migrate: demote pages during reclaimDave Hansen1-0/+9
This is mostly derived from a patch from Yang Shi: https://lore.kernel.org/linux-mm/[email protected]/ Add code to the reclaim path (shrink_page_list()) to "demote" data to another NUMA node instead of discarding the data. This always avoids the cost of I/O needed to read the page back in and sometimes avoids the writeout cost when the page is dirty. A second pass through shrink_page_list() will be made if any demotions fail. This essentially falls back to normal reclaim behavior in the case that demotions fail. Previous versions of this patch may have simply failed to reclaim pages which were eligible for demotion but were unable to be demoted in practice. For some cases, for example, MADV_PAGEOUT, the pages are always discarded instead of demoted to follow the kernel API definition. Because MADV_PAGEOUT is defined as freeing specified pages regardless in which tier they are. Note: This just adds the start of infrastructure for migration. It is actually disabled next to the FIXME in migrate_demote_page_ok(). [[email protected]: v11] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Yang Shi <[email protected]> Reviewed-by: Wei Xu <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Keith Busch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm/migrate: enable returning precise migrate_pages() success countYang Shi1-2/+3
Under normal circumstances, migrate_pages() returns the number of pages migrated. In error conditions, it returns an error code. When returning an error code, there is no way to know how many pages were migrated or not migrated. Make migrate_pages() return how many pages are demoted successfully for all cases, including when encountering errors. Page reclaim behavior will depend on this in subsequent patches. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: "Huang, Ying" <[email protected]> Suggested-by: Oscar Salvador <[email protected]> [optional parameter] Reviewed-by: Yang Shi <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Xu <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Keith Busch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03userfaultfd: change mmap_changing to atomicNadav Amit1-4/+4
Patch series "userfaultfd: minor bug fixes". Three unrelated bug fixes. The first two addresses possible issues (not too theoretical ones), but I did not encounter them in practice. The third patch addresses a test bug that causes the test to fail on my system. It has been sent before as part of a bigger RFC. This patch (of 3): mmap_changing is currently a boolean variable, which is set and cleared without any lock that protects against concurrent modifications. mmap_changing is supposed to mark whether userfaultfd page-faults handling should be retried since mappings are undergoing a change. However, concurrent calls, for instance to madvise(MADV_DONTNEED), might cause mmap_changing to be false, although the remove event was still not read (hence acknowledged) by the user. Change mmap_changing to atomic_t and increase/decrease appropriately. Add a debug assertion to see whether mmap_changing is negative. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: df2cc96e77011 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Signed-off-by: Nadav Amit <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Peter Xu <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03hugetlb: fix hugetlb cgroup refcounting during vma splitMike Kravetz1-0/+12
Guillaume Morin reported hitting the following WARNING followed by GPF or NULL pointer deference either in cgroups_destroy or in the kill_css path.: percpu ref (css_release) <= 0 (-1) after switching to atomic WARNING: CPU: 23 PID: 130 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x127/0x130 CPU: 23 PID: 130 Comm: ksoftirqd/23 Kdump: loaded Tainted: G O 5.10.60 #1 RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x127/0x130 Call Trace: rcu_core+0x30f/0x530 rcu_core_si+0xe/0x10 __do_softirq+0x103/0x2a2 run_ksoftirqd+0x2b/0x40 smpboot_thread_fn+0x11a/0x170 kthread+0x10a/0x140 ret_from_fork+0x22/0x30 Upon further examination, it was discovered that the css structure was associated with hugetlb reservations. For private hugetlb mappings the vma points to a reserve map that contains a pointer to the css. At mmap time, reservations are set up and a reference to the css is taken. This reference is dropped in the vma close operation; hugetlb_vm_op_close. However, if a vma is split no additional reference to the css is taken yet hugetlb_vm_op_close will be called twice for the split vma resulting in an underflow. Fix by taking another reference in hugetlb_vm_op_open. Note that the reference is only taken for the owner of the reserve map. In the more common fork case, the pointer to the reserve map is cleared for non-owning vmas. Link: https://lkml.kernel.org/r/[email protected] Fixes: e9fe92ae0cd2 ("hugetlb_cgroup: add reservation accounting for private mappings") Signed-off-by: Mike Kravetz <[email protected]> Reported-by: Guillaume Morin <[email protected]> Suggested-by: Guillaume Morin <[email protected]> Tested-by: Guillaume Morin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm: hwpoison: don't drop slab caches for offlining non-LRU pageYang Shi1-1/+1
In the current implementation of soft offline, if non-LRU page is met, all the slab caches will be dropped to free the page then offline. But if the page is not slab page all the effort is wasted in vain. Even though it is a slab page, it is not guaranteed the page could be freed at all. However the side effect and cost is quite high. It does not only drop the slab caches, but also may drop a significant amount of page caches which are associated with inode caches. It could make the most workingset gone in order to just offline a page. And the offline is not guaranteed to succeed at all, actually I really doubt the success rate for real life workload. Furthermore the worse consequence is the system may be locked up and unusable since the page cache release may incur huge amount of works queued for memcg release. Actually we ran into such unpleasant case in our production environment. Firstly, the workqueue of memory_failure_work_func is locked up as below: BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15 in-flight: 409271:memory_failure_work_func pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work workqueue mm_percpu_wq: flags=0x8 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 pending: vmstat_update workqueue cgroup_destroy: flags=0x0 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072 pending: css_release_work_fn There were over 12K css_release_work_fn queued, and this caused a few lockups due to the contention of worker pool lock with IRQ disabled, for example: NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G L 5.10.32-t1.el7.twitter.x86_64 #1 Hardware name: TYAN F5AMT /z /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021 Workqueue: events memory_failure_work_func RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0 Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47 RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002 RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140 RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000 R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001 R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068 FS: 0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0 Call Trace: __queue_work+0xd6/0x3c0 queue_work_on+0x1c/0x30 uncharge_batch+0x10e/0x110 mem_cgroup_uncharge_list+0x6d/0x80 release_pages+0x37f/0x3f0 __pagevec_release+0x1c/0x50 __invalidate_mapping_pages+0x348/0x380 inode_lru_isolate+0x10a/0x160 __list_lru_walk_one+0x7b/0x170 list_lru_walk_one+0x4a/0x60 prune_icache_sb+0x37/0x50 super_cache_scan+0x123/0x1a0 do_shrink_slab+0x10c/0x2c0 shrink_slab+0x1f1/0x290 drop_slab_node+0x4d/0x70 soft_offline_page+0x1ac/0x5b0 memory_failure_work_func+0x6a/0x90 process_one_work+0x19e/0x340 worker_thread+0x30/0x360 kthread+0x116/0x130 The lockup made the machine is quite unusable. And it also made the most workingset gone, the reclaimabled slab caches were reduced from 12G to 300MB, the page caches were decreased from 17G to 4G. But the most disappointing thing is all the effort doesn't make the page offline, it just returns: soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () It seems the aggressive behavior for non-LRU page didn't pay back, so it doesn't make too much sense to keep it considering the terrible side effect. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reported-by: David Mackey <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm/sparse: set SECTION_NID_SHIFT to 6Naoya Horiguchi1-1/+1
Currently SECTION_NID_SHIFT is set to 3, which is incorrect because bit 3 and 4 can be overlapped by sub-field for early NID, and can be unexpectedly set on NUMA systems. There are a few non-critical issues related to this: - Having SECTION_TAINT_ZONE_DEVICE set for wrong sections forces pfn_to_online_page() through the slow path, but doesn't actually break the kernel. - A kdump generation tool like makedumpfile uses this field to calculate the physical address to read. So wrong bits can make the tool access to wrong address and fail to create kdump. This can be avoided by the tool, so it's not critical. To fix it, set SECTION_NID_SHIFT to 6 which is the minimum number of available bits of section flag field. Link: https://lkml.kernel.org/r/[email protected] Fixes: 1f90a3477df3 ("mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions") Signed-off-by: Naoya Horiguchi <[email protected]> Reported-by: Kazuhito Hagio <[email protected]> Suggested-by: Dan Williams <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Wang Wensheng <[email protected]> Cc: Rui Xiang <[email protected]> Cc: Kazu <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm: sparse: remove __section_nr() functionOhhoon Kwon1-1/+0
As the last users of __section_nr() are gone, let's remove unused function __section_nr(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ohhoon Kwon <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Mike Rapoport <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm: sparse: pass section_nr to find_memory_blockOhhoon Kwon1-1/+1
With CONFIG_SPARSEMEM_EXTREME enabled, __section_nr() which converts mem_section to section_nr could be costly since it iterates all section roots to check if the given mem_section is in its range. On the other hand, __nr_to_section() which converts section_nr to mem_section can be done in O(1). Let's pass section_nr instead of mem_section ptr to find_memory_block() in order to reduce needless iterations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ohhoon Kwon <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Mike Rapoport <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm: change fault_in_pages_* to have an unsigned size parameterGreg Kroah-Hartman1-2/+2
fault_in_pages_writeable() and fault_in_pages_readable() treat the size parameter as unsigned, doing pointer math with the value, so make this explicit and set it to be a size_t type which all callers currently treat it as anyway. This solves the issue where static checkers get nervous seeing pointer arithmetic happening with a signed value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> Reported-by: Jordy Zomer <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: David Howells <[email protected]> Cc: William Kucharski <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm: remove flush_kernel_dcache_pageChristoph Hellwig1-4/+1
flush_kernel_dcache_page is a rather confusing interface that implements a subset of flush_dcache_page by not being able to properly handle page cache mapped pages. The only callers left are in the exec code as all other previous callers were incorrect as they could have dealt with page cache pages. Replace the calls to flush_kernel_dcache_page with calls to flush_dcache_page, which for all architectures does either exactly the same thing, can contains one or more of the following: 1) an optimization to defer the cache flush for page cache pages not mapped into userspace 2) additional flushing for mapped page cache pages if cache aliases are possible Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Linus Torvalds <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Cc: Alex Shi <[email protected]> Cc: Geoff Levand <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Guo Ren <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Cercueil <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Ulf Hansson <[email protected]> Cc: Vincent Chen <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm, memcg: remove unused functionsMiaohe Lin1-12/+0
Since commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat"), last user of memcg_stat_item_in_bytes() is gone. And since commit fa40d1ee9f15 ("mm: vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only the declaration of mem_cgroup_select_victim_node() is remained here. Remove them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Alex Shi <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03memcg: replace in_interrupt() by !in_task() in active_memcg()Vasily Averin1-1/+1
set_active_memcg() uses in_interrupt() check to select proper storage for cgroup: pointer on task struct or per-cpu pointer. It isn't fully correct: obsoleted in_interrupt() includes tasks with disabled BH. It's better to use '!in_task()' instead. Link: https://lkml.org/lkml/2021/7/26/487 Link: https://lkml.kernel.org/r/[email protected] Fixes: 37d5985c003d ("mm: kmem: prepare remote memcg charging infra for interrupt contexts") Signed-off-by: Vasily Averin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03memcg: cleanup racy sum avoidance codeShakeel Butt1-13/+2
We used to have per-cpu memcg and lruvec stats and the readers have to traverse and sum the stats from each cpu. This summing was racy and may expose transient negative values. So, an explicit check was added to avoid such scenarios. Now these stats are moved to rstat infrastructure and are no more per-cpu, so we can remove the fixup for transient negative values. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Roman Gushchin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03memcg: infrastructure to flush memcg statsShakeel Butt1-0/+6
At the moment memcg stats are read in four contexts: 1. memcg stat user interfaces 2. dirty throttling 3. page fault 4. memory reclaim Currently the kernel flushes the stats for first two cases. Flushing the stats for remaining two casese may have performance impact. Always flushing the memcg stats on the page fault code path may negatively impacts the performance of the applications. In addition flushing in the memory reclaim code path, though treated as slowpath, can become the source of contention for the global lock taken for stat flushing because when system or memcg is under memory pressure, many tasks may enter the reclaim path. This patch uses following mechanisms to solve these challenges: 1. Periodically flush the stats from root memcg every 2 seconds. This will time limit the out of sync stats. 2. Asynchronously flush the stats after fixed number of stat updates. In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for 2 seconds. 3. For avoiding thundering herd to flush the stats particularly from the memory reclaim context, introduce memcg local spinlock and let only one flusher active at a time. This could have been done through cgroup_rstat_lock lock but that lock is used by other subsystem and for userspace reading memcg stats. So, it is better to keep flushers introduced by this patch decoupled from cgroup_rstat_lock. However we would have to use irqsafe version of rstat flush but that is fine as this code path will be flushing for whole tree and do the work for everyone. No one will be waiting for that worker. [[email protected]: fix sleep-in-wrong context bug] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Tested-by: Marek Szyprowski <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Huang Ying <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03memcg: switch lruvec stats to rstatShakeel Butt1-22/+20
The commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat") switched memcg stats to rstat infrastructure but skipped the conversion of the lruvec stats as such stats are read in the performance critical code paths and flushing stats may have impacted the performances of the applications. This patch converts the lruvec stats to rstat and later patches add mechanisms to keep the performance impact to minimum. The rstat conversion comes with the price i.e. memory cost. Effectively this patch reverts the savings done by the commit f3344adf38bd ("mm: memcontrol: optimize per-lruvec stats counter memory usage"). However this cost is justified due to negative impact of the inaccurate lruvec stats on many heuristics. One such case is reported in [1]. The memory reclaim code is filled with plethora of heuristics and many of those heuristics reads the lruvec stats. So, inaccurate stats can make such heuristics ineffective. [1] reports the impact of inaccurate lruvec stats on the "cache trim mode" heuristic. Inaccurate lruvec stats can impact the deactivation and aging anon heuristics as well. [1] https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Muchun Song <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Huang Ying <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Koutný <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm, memcg: inline swap-related functions to improve disabled memcg configSuren Baghdasaryan1-3/+23
Inline mem_cgroup_try_charge_swap, mem_cgroup_uncharge_swap and cgroup_throttle_swaprate functions to perform mem_cgroup_disabled static key check inline before calling the main body of the function. This minimizes the memcg overhead in the pagefault and exit_mmap paths when memcgs are disabled using cgroup_disable=memory command-line option. This change results in ~1% overhead reduction when running PFT test [1] comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory} configuration on an 8-core ARM64 Android device. [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Yang Shi <[email protected]> Cc: Alex Shi <[email protected]> Cc: Wei Yang <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Miaohe Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm, memcg: inline mem_cgroup_{charge/uncharge} to improve disabled memcg configSuren Baghdasaryan1-3/+25
Inline mem_cgroup_{charge/uncharge} and mem_cgroup_uncharge_list functions functions to perform mem_cgroup_disabled static key check inline before calling the main body of the function. This minimizes the memcg overhead in the pagefault and exit_mmap paths when memcgs are disabled using cgroup_disable=memory command-line option. This change results in ~0.4% overhead reduction when running PFT test [1] comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory} configuration on an 8-core ARM64 Android device. [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Alex Shi <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Wei Yang <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03huge tmpfs: shmem_is_huge(vma, inode, index)Hugh Dickins1-3/+6
Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so that a consistent set of checks can be applied, even when the inode is accessed through read/write syscalls (with NULL vma) instead of mmaps (the index argument is seldom of interest, but required by mount option "huge=within_size"). Clean up and rearrange the checks a little. This then replaces the checks which shmem_fault() and shmem_getpage_gfp() were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes. Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing else needs that symlink check, so leave it there in shmem_getpage_gfp(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03huge tmpfs: SGP_NOALLOC to stop collapse_file() on raceHugh Dickins1-0/+1
khugepaged's collapse_file() currently uses SGP_NOHUGE to tell shmem_getpage() not to try allocating a huge page, in the very unlikely event that a racing hole-punch removes the swapped or fallocated page as soon as i_pages lock is dropped. We want to consolidate shmem's huge decisions, removing SGP_HUGE and SGP_NOHUGE; but cannot quite persuade ourselves that it's okay to regress the protection in this case - Yang Shi points out that the huge page would remain indefinitely, charged to root instead of the intended memcg. collapse_file() should not even allocate a small page in this case: why proceed if someone is punching a hole? SGP_READ is almost the right flag here, except that it optimizes away from a fallocated page, with NULL to tell caller to fill with zeroes (like a hole); whereas collapse_file()'s sequence relies on using a cache page. Add SGP_NOALLOC just for this. There are too many consecutive "if (page"s there in shmem_getpage_gfp(): group it better; and fix the outdated "bring it back from swap" comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZEHugh Dickins1-0/+13
A successful shmem_fallocate() guarantees that the extent has been reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used. But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to split huge pages and free their excess beyond i_size; and by other uses of split_huge_page() near i_size. It's sad to add a shmem inode field just for this, but I did not find a better way to keep the guarantee. A flag to say KEEP_SIZE has been used would be cheaper, but I'm averse to unclearable flags. The fallocend field is not perfect either (many disjoint ranges might be fallocated), but good enough; and gains another use later on. Link: https://lkml.kernel.org/r/[email protected] Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03shmem: use raw_spinlock_t for ->stat_lockSebastian Andrzej Siewior1-1/+1
Each CPU has SHMEM_INO_BATCH inodes available in `->ino_batch' which is per-CPU. Access here is serialized by disabling preemption. If the pool is empty, it gets reloaded from `->next_ino'. Access here is serialized by ->stat_lock which is a spinlock_t and can not be acquired with disabled preemption. One way around it would make per-CPU ino_batch struct containing the inode number a local_lock_t. Another solution is to promote ->stat_lock to a raw_spinlock_t. The critical sections are short. The mpol_put() must be moved outside of the critical section to avoid invoking the destructor with disabled preemption. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm: delete unused get_kernel_page()John Hubbard1-1/+0
get_kernel_page() was added in 2012 by [1]. It was used for a while for NFS, but then in 2014, a refactoring [2] removed all callers, and it has apparently not been used since. Remove get_kernel_page() because it has no callers. [1] commit 18022c5d8627 ("mm: add get_kernel_page[s] for pinning of kernel addresses for I/O") [2] commit 91f79c43d1b5 ("new helper: iov_iter_get_pages_alloc()") Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: David S. Miller <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Paris <[email protected]> Cc: James Morris <[email protected]> Cc: Mike Christie <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiaotian Feng <[email protected]> Cc: Mark Salter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm/gup: remove try_get_page(), call try_get_compound_head() directlyJohn Hubbard1-9/+1
try_get_page() is very similar to try_get_compound_head(), and in fact try_get_page() has fallen a little behind in terms of maintenance: try_get_compound_head() handles speculative page references more thoroughly. There are only two try_get_page() callsites, so just call try_get_compound_head() directly from those, and remove try_get_page() entirely. Also, seeing as how this changes try_get_compound_head() into a non-static function, provide some kerneldoc documentation for it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm/gup: small refactoring: simplify try_grab_page()John Hubbard1-2/+2
try_grab_page() does the same thing as try_grab_compound_head(..., refs=1, ...), just with a different API. So there is a lot of code duplication there. Change try_grab_page() to call try_grab_compound_head(), while keeping the API contract identical for callers. Also, now that try_grab_compound_head() always has a caller, remove the __maybe_unused annotation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03include/linux/buffer_head.h: fix boolreturn.cocci warningsJing Yangyang1-1/+1
./include/linux/buffer_head.h:412:64-65:WARNING:return of 0/1 in function 'has_bh_in_lru' with return type bool Return statements in functions returning bool should use true/false instead of 1/0. Generated by: scripts/coccinelle/misc/boolreturn.cocci Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jing Yangyang <[email protected]> Reported-by: Zeal Robot <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03writeback: memcg: simplify cgroup_writeback_by_idShakeel Butt2-1/+16
Currently cgroup_writeback_by_id calls mem_cgroup_wb_stats() to get dirty pages for a memcg. However mem_cgroup_wb_stats() does a lot more than just get the number of dirty pages. Just directly get the number of dirty pages instead of calling mem_cgroup_wb_stats(). Also cgroup_writeback_by_id() is only called for best-effort dirty flushing, so remove the unused 'nr' parameter and no need to explicitly flush memcg stats. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03writeback: fix bandwidth estimate for spiky workloadJan Kara2-0/+2
Michael Stapelberg has reported that for workload with short big spikes of writes (GCC linker seem to trigger this frequently) the write throughput is heavily underestimated and tends to steadily sink until it reaches zero. This has rather bad impact on writeback throttling (causing stalls). The problem is that writeback throughput estimate gets updated at most once per 200 ms. One update happens early after we submit pages for writeback (at that point writeout of only small fraction of pages is completed and thus observed throughput is tiny). Next update happens only during the next write spike (updates happen only from inode writeback and dirty throttling code) and if that is more than 1s after previous spike, we decide system was idle and just ignore whatever was written until this moment. Fix the problem by making sure writeback throughput estimate is also updated shortly after writeback completes to get reasonable estimate of throughput for spiky workloads. [[email protected]: avoid division by 0 in wb_update_dirty_ratelimit()] Link: https://lore.kernel.org/lkml/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reported-by: Michael Stapelberg <[email protected]> Tested-by: Michael Stapelberg <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03writeback: reliably update bandwidth estimationJan Kara2-1/+19
Currently we trigger writeback bandwidth estimation from balance_dirty_pages() and from wb_writeback(). However neither of these need to trigger when the system is relatively idle and writeback is triggered e.g. from fsync(2). Make sure writeback estimates happen reliably by triggering them from do_writepages(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Cc: Michael Stapelberg <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03writeback: track number of inodes under writebackJan Kara1-0/+1
Patch series "writeback: Fix bandwidth estimates", v4. Fix estimate of writeback throughput when device is not fully busy doing writeback. Michael Stapelberg has reported that such workload (e.g. generated by linking) tends to push estimated throughput down to 0 and as a result writeback on the device is practically stalled. The first three patches fix the reported issue, the remaining two patches are unrelated cleanups of problems I've noticed when reading the code. This patch (of 4): Track number of inodes under writeback for each bdi_writeback structure. We will use this to decide whether wb does any IO and so we can estimate its writeback throughput. In principle we could use number of pages under writeback (WB_WRITEBACK counter) for this however normal percpu counter reads are too inaccurate for our purposes and summing the counter is too expensive. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Michael Stapelberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03mm: report a more useful address for reclaim acquisitionMatthew Wilcox (Oracle)1-4/+4
A recent lockdep report included these lines: [ 96.177910] 3 locks held by containerd/770: [ 96.177934] #0: ffff88810815ea28 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x115/0x770 [ 96.177999] #1: ffffffff82915020 (rcu_read_lock){....}-{1:2}, at: get_swap_device+0x33/0x140 [ 96.178057] #2: ffffffff82955ba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 While it was not useful to that bug report to know where the reclaim lock had been acquired, it might be useful under other circumstances. Allow the caller of __fs_reclaim_acquire to specify the instruction pointer to use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Omar Sandoval <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03fs: update documentation of get_write_access() and friendsDavid Hildenbrand1-7/+12
As VM_DENYWRITE does no longer exists, let's spring-clean the documentation of get_write_access() and friends. Acked-by: "Eric W. Biederman" <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: David Hildenbrand <[email protected]>
2021-09-03mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()David Hildenbrand1-1/+2
Let's also remove masking off MAP_DENYWRITE from ksys_mmap_pgoff(): the last in-tree occurrence of MAP_DENYWRITE is now in LEGACY_MAP_MASK, which accepts the flag e.g., for MAP_SHARED_VALIDATE; however, the flag is ignored throughout the kernel now. Add a comment to LEGACY_MAP_MASK stating that MAP_DENYWRITE is ignored. Acked-by: "Eric W. Biederman" <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: David Hildenbrand <[email protected]>
2021-09-03mm: remove VM_DENYWRITEDavid Hildenbrand2-2/+0
All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be set from user space, so all users are gone; let's remove it. Acked-by: "Eric W. Biederman" <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: David Hildenbrand <[email protected]>
2021-09-03kernel/fork: always deny write access to current MM exe_fileDavid Hildenbrand1-1/+1
We want to remove VM_DENYWRITE only currently only used when mapping the executable during exec. During exec, we already deny_write_access() the executable, however, after exec completes the VMAs mapped with VM_DENYWRITE effectively keeps write access denied via deny_write_access(). Let's deny write access when setting or replacing the MM exe_file. With this change, we can remove VM_DENYWRITE for mapping executables. Make set_mm_exe_file() return an error in case deny_write_access() fails; note that this should never happen, because exec code does a deny_write_access() early and keeps write access denied when calling set_mm_exe_file. However, it makes the code easier to read and makes set_mm_exe_file() and replace_mm_exe_file() look more similar. This represents a minor user space visible change: sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file cannot be opened writable. Note that we can already fail with -EACCES if the file doesn't have execute permissions. Acked-by: "Eric W. Biederman" <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: David Hildenbrand <[email protected]>
2021-09-03kernel/fork: factor out replacing the current MM exe_fileDavid Hildenbrand1-0/+1
Let's factor the main logic out into replace_mm_exe_file(), such that all mm->exe_file logic is contained in kernel/fork.c. While at it, perform some simple cleanups that are possible now that we're simplifying the individual functions. Acked-by: Christian Brauner <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: David Hildenbrand <[email protected]>
2021-09-03libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI for Samsung 860 and 870 SSD.Kate Hsuan1-0/+1
Many users are reporting that the Samsung 860 and 870 SSD are having various issues when combined with AMD/ATI (vendor ID 0x1002) SATA controllers and only completely disabling NCQ helps to avoid these issues. Always disabling NCQ for Samsung 860/870 SSDs regardless of the host SATA adapter vendor will cause I/O performance degradation with well behaved adapters. To limit the performance impact to ATI adapters, introduce the ATA_HORKAGE_NO_NCQ_ON_ATI flag to force disable NCQ only for these adapters. Also, two libata.force parameters (noncqati and ncqati) are introduced to disable and enable the NCQ for the system which equipped with ATI SATA adapter and Samsung 860 and 870 SSDs. The user can determine NCQ function to be enabled or disabled according to the demand. After verifying the chipset from the user reports, the issue appears on AMD/ATI SB7x0/SB8x0/SB9x0 SATA Controllers and does not appear on recent AMD SATA adapters. The vendor ID of ATI should be 0x1002. Therefore, ATA_HORKAGE_NO_NCQ_ON_AMD was modified to ATA_HORKAGE_NO_NCQ_ON_ATI. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=201693 Signed-off-by: Kate Hsuan <[email protected]> Reviewed-by: Hans de Goede <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-09-02Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds4-59/+19
Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (ufs, qla2xxx, target, smartpqi, lpfc, mpt3sas). The core change causing the most churn was replacing the command request field request with a macro, allowing us to offset map to it and remove the redundant field; the same was also done for the tag field. The most impactful change is the final removal of scsi_ioctl, which has been deprecated for over a decade" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits) scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1 scsi: ufs: ufs-exynos: Fix static checker warning scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI scsi: lpfc: Use the proper SCSI midlayer interfaces for PI scsi: lpfc: Copyright updates for 14.0.0.1 patches scsi: lpfc: Update lpfc version to 14.0.0.1 scsi: lpfc: Add bsg support for retrieving adapter cmf data scsi: lpfc: Add cmf_info sysfs entry scsi: lpfc: Add debugfs support for cm framework buffers scsi: lpfc: Add support for maintaining the cm statistics buffer scsi: lpfc: Add rx monitoring statistics scsi: lpfc: Add support for the CM framework scsi: lpfc: Add cmfsync WQE support scsi: lpfc: Add support for cm enablement buffer scsi: lpfc: Add cm statistics buffer support scsi: lpfc: Add EDC ELS support scsi: lpfc: Expand FPIN and RDF receive logging scsi: lpfc: Add MIB feature enablement support scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware scsi: fc: Add EDC ELS definition ...
2021-09-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-9/+47
Pull rdma updates from Jason Gunthorpe: "This is quite a small cycle, no major series stands out. The HNS and rxe drivers saw the most activity this cycle, with rxe being broken for a good chunk of time. The significant deleted line count is due to a SPDX cleanup series. Summary: - Various cleanup and small features for rtrs - kmap_local_page() conversions - Driver updates and fixes for: efa, rxe, mlx5, hfi1, qed, hns - Cache the IB subnet prefix - Rework how CRC is calcuated in rxe - Clean reference counting in iwpm's netlink - Pull object allocation and lifecycle for user QPs to the uverbs core code - Several small hns features and continued general code cleanups - Fix the scatterlist confusion of orig_nents/nents introduced in an earlier patch creating the append operation" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (90 commits) RDMA/mlx5: Relax DCS QP creation checks RDMA/hns: Delete unnecessary blank lines. RDMA/hns: Encapsulate the qp db as a function RDMA/hns: Adjust the order in which irq are requested and enabled RDMA/hns: Remove RST2RST error prints for hw v1 RDMA/hns: Remove dqpn filling when modify qp from Init to Init RDMA/hns: Fix QP's resp incomplete assignment RDMA/hns: Fix query destination qpn RDMA/hfi1: Convert to SPDX identifier IB/rdmavt: Convert to SPDX identifier RDMA/hns: Bugfix for incorrect association between dip_idx and dgid RDMA/hns: Bugfix for the missing assignment for dip_idx RDMA/hns: Bugfix for data type of dip_idx RDMA/hns: Fix incorrect lsn field RDMA/irdma: Remove the repeated declaration RDMA/core/sa_query: Retry SA queries RDMA: Use the sg_table directly and remove the opencoded version from umem lib/scatterlist: Fix wrong update of orig_nents lib/scatterlist: Provide a dedicated function to support table append RDMA/hns: Delete unused hns bitmap interface ...
2021-09-02Merge tag 'clk-for-linus' of ↵Linus Torvalds5-5/+21
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk updates from Stephen Boyd: "Nothing changed in the clk framework core this time around. We did get some updates to the basic clk types to use determine_rate for the divider type and add a power of two fractional divider flag though. Otherwise, this is a collection of clk driver updates. More than half the diffstat is in the Qualcomm clk driver where we add a bunch of data to describe clks on various SoCs and fix bugs. The other big new thing in here is the Mediatek MT8192 clk driver. That's been under review for a while and it's nice to see that it's finally upstream. Beyond that it's the usual set of minor fixes and tweaks to clk drivers. There are some non-clk driver bits in here which have all been acked by the respective maintainers. New Drivers: - Support video, gpu, display clks on qcom sc7280 SoCs - GCC clks on qcom MSM8953, SM4250/6115, and SM6350 SoCs - Multimedia clks (MMCC) on qcom MSM8994/MSM8992 - RPMh clks on qcom SM6350 SoCs - Support for Mediatek MT8192 SoCs - Add display (DU and DSI) clocks on Renesas R-Car V3U - Add I2C, DMAC, USB, sound (SSIF-2), GPIO, CANFD, and ADC clocks and resets on Renesas RZ/G2L Updates: - Support the SD/OE pin on IDT VersaClock 5 and 6 clock generators - Add power of two flag to fractional divider clk type - Migrate some clk drivers to clk_divider_ops.determine_rate - Migrate to clk_parent_data in gcc-sdm660 - Fix CLKOUT clocks on i.MX8MM and i.MX8MN by using imx_clk_hw_mux2 - Switch from .round_rate to .determine_rate in clk-divider-gate - Fix clock tree update for TF-A controlled clocks for all i.MX8M - Add missing M7 core clock for i.MX8MN - YAML conversion of rk3399 clock controller binding - Removal of GRF dependency for the rk3328/rk3036 pll types - Drop CLK_IS_CRITICAL flag from Tegra fuse clk - Make CLK_R9A06G032 Kconfig symbol invisible - Convert various DT bindings to YAML" * tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (128 commits) dt-bindings: clock: samsung: fix header path in example clk: tegra: fix old-style declaration clk: qcom: Add SM6350 GCC driver MAINTAINERS: clock: include S3C and S5P in Samsung SoC clock entry dt-bindings: clock: samsung: convert S5Pv210 AudSS to dtschema dt-bindings: clock: samsung: convert Exynos AudSS to dtschema dt-bindings: clock: samsung: convert Exynos4 to dtschema dt-bindings: clock: samsung: convert Exynos3250 to dtschema dt-bindings: clock: samsung: convert Exynos542x to dtschema dt-bindings: clock: samsung: add bindings for Exynos external clock dt-bindings: clock: samsung: convert Exynos5250 to dtschema clk: vc5: Add properties for configuring SD/OE behavior clk: vc5: Use dev_err_probe dt-bindings: clk: vc5: Add properties for configuring the SD/OE pin dt-bindings: clock: brcm,iproc-clocks: fix armpll properties clk: zynqmp: Fix kernel-doc format clk: at91: clk-generated: Limit the requested rate to our range clk: ralink: avoid to set 'CLK_IS_CRITICAL' flag for gates clk: zynqmp: Fix a memory leak clk: zynqmp: Check the return type ...
2021-09-02Merge tag 'platform-drivers-x86-v5.15-1' of ↵Linus Torvalds2-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver updates from Hans de Goede: "Highlights: - Move all the Intel drivers into their own subdir(s) (mostly Kate's work) - New meraki-mx100 platform driver - Asus WMI driver enhancements, including support for /sys/firmware/acpi/platform_profile - New BIOS SAR driver for Intel M.2 WWAM modems - Alder Lake support for the Intel PMC driver - A whole bunch of cleanups + fixes all over the place" * tag 'platform-drivers-x86-v5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (65 commits) platform/x86: dell-smbios-wmi: Add missing kfree in error-exit from run_smbios_call platform/x86: dell-smbios-wmi: Avoid false-positive memcpy() warning platform/x86: ISST: use semi-colons instead of commas platform/x86: asus-wmi: Fix "unsigned 'retval' is never less than zero" smatch warning platform/x86: asus-wmi: Delete impossible condition platform/x86: hp_accel: Convert to be a platform driver platform/x86: hp_accel: Remove _INI method call platform/mellanox: mlxbf-pmc: fix kernel-doc notation platform/x86/intel: pmc/core: Add GBE Package C10 fix for Alder Lake PCH platform/x86/intel: pmc/core: Add Alder Lake low power mode support for pmc core platform/x86/intel: pmc/core: Add Latency Tolerance Reporting (LTR) support to Alder Lake platform/x86/intel: pmc/core: Add Alderlake support to pmc core driver platform/x86: intel-wmi-thunderbolt: Move to intel sub-directory platform/x86: intel-wmi-sbl-fw-update: Move to intel sub-directory platform/x86: intel-vbtn: Move to intel sub-directory platform/x86: intel_oaktrail: Move to intel sub-directory platform/x86: intel_int0002_vgpio: Move to intel sub-directory platform/x86: intel-hid: Move to intel sub-directory platform/x86: intel_atomisp2: Move to intel sub-directory platform/x86: intel_speed_select_if: Move to intel sub-directory ...
2021-09-02ceph: flush mdlog before umountingXiubo Li1-0/+1
Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02Merge tag 'vfio-v5.15-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds5-11/+298
Pull VFIO updates from Alex Williamson: - Fix dma-valid return WAITED implementation (Anthony Yznaga) - SPDX license cleanups (Cai Huoqing) - Split vfio-pci-core from vfio-pci and enhance PCI driver matching to support future vendor provided vfio-pci variants (Yishai Hadas, Max Gurtovoy, Jason Gunthorpe) - Replace duplicated reflck with core support for managing first open, last close, and device sets (Jason Gunthorpe, Max Gurtovoy, Yishai Hadas) - Fix non-modular mdev support and don't nag about request callback support (Christoph Hellwig) - Add semaphore to protect instruction intercept handler and replace open-coded locks in vfio-ap driver (Tony Krowiak) - Convert vfio-ap to vfio_register_group_dev() API (Jason Gunthorpe) * tag 'vfio-v5.15-rc1' of git://github.com/awilliam/linux-vfio: (37 commits) vfio/pci: Introduce vfio_pci_core.ko vfio: Use kconfig if XX/endif blocks instead of repeating 'depends on' vfio: Use select for eventfd PCI / VFIO: Add 'override_only' support for VFIO PCI sub system PCI: Add 'override_only' field to struct pci_device_id vfio/pci: Move module parameters to vfio_pci.c vfio/pci: Move igd initialization to vfio_pci.c vfio/pci: Split the pci_driver code out of vfio_pci_core.c vfio/pci: Include vfio header in vfio_pci_core.h vfio/pci: Rename ops functions to fit core namings vfio/pci: Rename vfio_pci_device to vfio_pci_core_device vfio/pci: Rename vfio_pci_private.h to vfio_pci_core.h vfio/pci: Rename vfio_pci.c to vfio_pci_core.c vfio/ap_ops: Convert to use vfio_register_group_dev() s390/vfio-ap: replace open coded locks for VFIO_GROUP_NOTIFY_SET_KVM notification s390/vfio-ap: r/w lock for PQAP interception handler function pointer vfio/type1: Fix vfio_find_dma_valid return vfio-pci/zdev: Remove repeated verbose license text vfio: platform: reset: Convert to SPDX identifier vfio: Remove struct vfio_device_ops open/release ...
2021-09-02locking/rwsem: Add missing __init_rwsem() for PREEMPT_RTMike Galbraith1-10/+2
730633f0b7f95 became the first direct caller of __init_rwsem() vs the usual init_rwsem(), exposing PREEMPT_RT's lack thereof. Add it. [ tglx: Move it out of line ] Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-09-02Merge branch 'remotes/lorenzo/pci/endpoint'Bjorn Helgaas2-27/+46
- Add max-virtual-functions to endpoint binding (Kishon Vijay Abraham I) - Add pci_epf_add_vepf() API to add virtual function to endpoint (Kishon Vijay Abraham I) - Add pci_epf_vepf_link() to link virtual function to endpoint physical function (Kishon Vijay Abraham I) - Add virtual function number to pci_epc_ops endpoint ops interfaces (Kishon Vijay Abraham I) - Simplify register base address computation for endpoint BAR configuration (Kishon Vijay Abraham I) - Add support to configure virtual functions in cadence endpoint driver (Kishon Vijay Abraham I) - Add SR-IOV configuration to endpoint test driver (Kishon Vijay Abraham I) - Document configfs usage to create virtual functions for endpoints (Kishon Vijay Abraham I) * remotes/lorenzo/pci/endpoint: Documentation: PCI: endpoint/pci-endpoint-cfs: Guide to use SR-IOV misc: pci_endpoint_test: Populate sriov_configure ops to configure SR-IOV device PCI: cadence: Add support to configure virtual functions PCI: cadence: Simplify code to get register base address for configuring BAR PCI: endpoint: Add virtual function number in pci_epc ops PCI: endpoint: Add support to link a physical function to a virtual function PCI: endpoint: Add support to add virtual function in endpoint core dt-bindings: PCI: pci-ep: Add binding to specify virtual function
2021-09-02Merge branch 'remotes/lorenzo/pci/hyper-v'Bjorn Helgaas1-0/+11
- Add domain_nr in struct pci_host_bridge (Boqun Feng) - Use host bridge MSI domain for root buses if present (Boqun Feng) - Allow ARM64 virtual host bridge with no ACPI companion (e.g., Hyper-V) (Boqun Feng) - Make Hyper-V enumeration more generic (Arnd Bergmann) - Set Hyper-V domain_nr at probe-time (Boqun Feng) - Set up Hyper-V MSI domain at bridge probe-time (Boqun Feng) - Enable Hyper-V bridge probing on ARM64 (Boqun Feng) * remotes/lorenzo/pci/hyper-v: PCI: hv: Turn on the host bridge probing on ARM64 PCI: hv: Set up MSI domain at bridge probing time PCI: hv: Set ->domain_nr of pci_host_bridge at probing time PCI: hv: Generify PCI probing arm64: PCI: Support root bridge preparation for Hyper-V arm64: PCI: Restructure pcibios_root_bridge_prepare() PCI: Support populating MSI domains of root buses via bridges PCI: Introduce domain_nr in pci_host_bridge
2021-09-02Merge branch 'pci/misc'Bjorn Helgaas1-20/+3
- Add pci_numachip_init() declaration (Krzysztof Wilczyński) - Allocate pci_dev_str_match_path() string atomically (Dan Carpenter) - Drop error message when Precision Time Measurement supported but not enabled (Jakub Kicinski) - Correct the pci_iomap.h header guard #endif comment (Jonathan Cameron) - Add schedule point in proc_bus_pci_read() (Krzysztof Wilczyński) - Make saved capability state private to core (Bjorn Helgaas) - Sync __pci_register_driver() stub for CONFIG_PCI=n (Andy Shevchenko) - Convert sta2x11 from PCI-DMA-API to generic DMA-API (Christophe JAILLET) * pci/misc: x86/PCI: sta2x11: switch from 'pci_' to 'dma_' API PCI: Sync __pci_register_driver() stub for CONFIG_PCI=n PCI: Make saved capability state private to core PCI: Add schedule point in proc_bus_pci_read() PCI: Correct the pci_iomap.h header guard #endif comment PCI/PTM: Remove error message at boot PCI: Fix pci_dev_str_match_path() alloc while atomic bug x86/PCI: Add pci_numachip_init() declaration # Conflicts: # include/linux/pci.h
2021-09-02Merge branch 'pci/vpd'Bjorn Helgaas1-79/+32
- Check Resource Item Names against those defined for type (Bjorn Helgaas) - Treat initial 0xff as missing EEPROM (Heiner Kallweit) - Reject resource tags with invalid size (Bjorn Helgaas) - Don't check Large Resource Item Names for validity (Bjorn Helgaas) - Allow access to valid parts of VPD if some is invalid (Bjorn Helgaas) - Remove pci_vpd_size() old_size argument (Heiner Kallweit) - Make pci_vpd_wait() uninterruptible (Heiner Kallweit) - Remove struct pci_vpd.flag (Heiner Kallweit) - Remove struct pci_vpd_ops (Heiner Kallweit) - Remove struct pci_vpd.valid member (Heiner Kallweit) - Embed struct pci_vpd in struct pci_dev (Heiner Kallweit) - Determine VPD size in pci_vpd_init() (Heiner Kallweit) - Treat invalid VPD like missing VPD capability (Heiner Kallweit) - Add pci_vpd_alloc() to allocate buffer and read VPD into it (Heiner Kallweit) - Add pci_vpd_find_ro_info_keyword() (Heiner Kallweit) - Add pci_vpd_check_csum() (Heiner Kallweit) - Add pci_vpd_find_id_string() (Heiner Kallweit) - Read VPD with pci_vpd_alloc() (bnx2x, bnxt, sfc, sfc falcon, tg3 drivers) (Heiner Kallweit) - Search VPD with pci_vpd_find_ro_info_keyword() (bnx2, bnx2x, bnxt, cxgb4, cxlflash SCSI, sfc, sfc falcon, tg3 drivers) (Heiner Kallweit) - Search VPD with pci_vpd_find_id_string() (cxgb4 driver) (Heiner Kallweit) - Validate VPD checksum with pci_vpd_check_csum() (cxgb4, tg3 drivers) (Heiner Kallweit) - Replace open-coded byte swapping with swab32s() in bnx2 (Heiner Kallweit) - Remove unused vpd_param member ec (Heiner Kallweit) - Stop exporting pci_vpd_find_tag(), pci_vpd_find_info_keyword() (Heiner Kallweit) - Move several VPD defines and inlines to internal PCI core (Heiner Kallweit) * pci/vpd: PCI/VPD: Use unaligned access helpers PCI/VPD: Clean up public VPD defines and inline functions cxgb4: Use pci_vpd_find_id_string() to find VPD ID string PCI/VPD: Add pci_vpd_find_id_string() PCI/VPD: Include post-processing in pci_vpd_find_tag() PCI/VPD: Stop exporting pci_vpd_find_info_keyword() PCI/VPD: Stop exporting pci_vpd_find_tag() scsi: cxlflash: Search VPD with pci_vpd_find_ro_info_keyword() cxgb4: Search VPD with pci_vpd_find_ro_info_keyword() cxgb4: Remove unused vpd_param member ec cxgb4: Validate VPD checksum with pci_vpd_check_csum() bnxt: Search VPD with pci_vpd_find_ro_info_keyword() bnxt: Read VPD with pci_vpd_alloc() bnx2x: Search VPD with pci_vpd_find_ro_info_keyword() bnx2x: Read VPD with pci_vpd_alloc() bnx2: Replace open-coded byte swapping with swab32s() bnx2: Search VPD with pci_vpd_find_ro_info_keyword() sfc: falcon: Search VPD with pci_vpd_find_ro_info_keyword() sfc: falcon: Read VPD with pci_vpd_alloc() tg3: Search VPD with pci_vpd_find_ro_info_keyword() tg3: Validate VPD checksum with pci_vpd_check_csum() tg3: Read VPD with pci_vpd_alloc() sfc: Search VPD with pci_vpd_find_ro_info_keyword() sfc: Read VPD with pci_vpd_alloc() PCI/VPD: Add pci_vpd_check_csum() PCI/VPD: Add pci_vpd_find_ro_info_keyword() PCI/VPD: Add pci_vpd_alloc() PCI/VPD: Treat invalid VPD like missing VPD capability PCI/VPD: Determine VPD size in pci_vpd_init() PCI/VPD: Embed struct pci_vpd in struct pci_dev PCI/VPD: Remove struct pci_vpd.valid member PCI/VPD: Remove struct pci_vpd_ops PCI/VPD: Reorder pci_read_vpd(), pci_write_vpd() PCI/VPD: Remove struct pci_vpd.flag PCI/VPD: Make pci_vpd_wait() uninterruptible PCI/VPD: Remove pci_vpd_size() old_size argument PCI/VPD: Allow access to valid parts of VPD if some is invalid PCI/VPD: Don't check Large Resource Item Names for validity PCI/VPD: Reject resource tags with invalid size PCI/VPD: Treat initial 0xff as missing EEPROM PCI/VPD: Check Resource Item Names against those valid for type PCI/VPD: Correct diagnostic for VPD read failure
2021-09-02Merge branch 'pci/virtualization'Bjorn Helgaas1-1/+2
- Add ACS quirks for NXP LX2xx0 and LX2xx2 platforms (Wasim Khan) - Add ACS quirks for Cavium multi-function devices (George Cherian) - Enforce pci=noats with Transaction Blocking (Alex Williamson) * pci/virtualization: PCI/ACS: Enforce pci=noats with Transaction Blocking PCI: Add ACS quirks for Cavium multi-function devices PCI: Add ACS quirks for NXP LX2xx0 and LX2xx2 platforms