aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-11-09kernel.h: drop unneeded <linux/kernel.h> inclusion from other headersAndy Shevchenko4-3/+1
Patch series "kernel.h further split", v5. kernel.h is a set of something which is not related to each other and often used in non-crossed compilation units, especially when drivers need only one or two macro definitions from it. This patch (of 7): There is no evidence we need kernel.h inclusion in certain headers. Drop unneeded <linux/kernel.h> inclusion from other headers. [[email protected]: bottom_half.h needs kernel] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Will Deacon <[email protected]> Cc: Waiman Long <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Sakari Ailus <[email protected]> Cc: Laurent Pinchart <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Thorsten Leemhuis <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09proc: allow pid_revalidate() during LOOKUP_RCUStephen Brennan1-8/+10
Problem Description: When running running ~128 parallel instances of TZ=/etc/localtime ps -fe >/dev/null on a 128CPU machine, the %sys utilization reaches 97%, and perf shows the following code path as being responsible for heavy contention on the d_lockref spinlock: walk_component() lookup_fast() d_revalidate() pid_revalidate() // returns -ECHILD unlazy_child() lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention The reason is that pid_revalidate() is triggering a drop from RCU to ref path walk mode. All concurrent path lookups thus try to grab a reference to the dentry for /proc/, before re-executing pid_revalidate() and then stepping into the /proc/$pid directory. Thus there is huge spinlock contention. This patch allows pid_revalidate() to execute in RCU mode, meaning that the path lookup can successfully enter the /proc/$pid directory while still in RCU mode. Later on, the path lookup may still drop into ref mode, but the contention will be much reduced at this point. By applying this patch, %sys utilization falls to around 85% under the same workload, and the number of ps processes executed per unit time increases by 3x-4x. Although this particular workload is a bit contrived, we have seen some large collections of eager monitoring scripts which produced similarly high %sys time due to contention in the /proc directory. As a result this patch, Al noted that several procfs methods which were only called in ref-walk mode could now be called from RCU mode. To ensure that this patch is safe, I audited all the inode get_link and permission() implementations, as well as dentry d_revalidate() implementations, in fs/proc. The purpose here is to ensure that they either are safe to call in RCU (i.e. don't sleep) or correctly bail out of RCU mode if they don't support it. My analysis shows that all at-risk procfs methods are safe to call under RCU, and thus this patch is safe. Procfs RCU-walk Analysis: This analysis is up-to-date with 5.15-rc3. When called under RCU mode, these functions have arguments as follows: * get_link() receives a NULL dentry pointer when called in RCU mode. * permission() receives MAY_NOT_BLOCK in the mode parameter when called from RCU. * d_revalidate() receives LOOKUP_RCU in flags. For the following functions, either they are trivially RCU safe, or they explicitly bail at the beginning of the function when they run: proc_ns_get_link (bails out) proc_get_link (RCU safe) proc_pid_get_link (bails out) map_files_d_revalidate (bails out) map_misc_d_revalidate (bails out) proc_net_d_revalidate (RCU safe) proc_sys_revalidate (bails out, also not under /proc/$pid) tid_fd_revalidate (bails out) proc_sys_permission (not under /proc/$pid) The remainder of the functions require a bit more detail: * proc_fd_permission: RCU safe. All of the body of this function is under rcu_read_lock(), except generic_permission() which declares itself RCU safe in its documentation string. * proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware and otherwise looks safe. The same is true of proc_thread_self_get_link. * proc_map_files_get_link: calls ns_capable, which calls capable(), and thus calls into the audit code (see note #1 below). The remainder is just a call to the trivially safe proc_pid_get_link(). * proc_pid_permission: calls ptrace_may_access(), which appears RCU safe, although it does call into the "security_ptrace_access_check()" hook, which looks safe under smack and selinux. Just the audit code is of concern. Also uses get_task_struct() and put_task_struct(), see note #2 below. * proc_tid_comm_permission: Appears safe, though calls put_task_struct (see note #2 below). Note #1: Most of the concern of RCU safety has centered around the audit code. However, since b17ec22fb339 ("selinux: slow_avc_audit has become non-blocking"), it's safe to call this code under RCU. So all of the above are safe by my estimation. Note #2: get_task_struct() and put_task_struct(): The majority of get_task_struct() is under RCU read lock, and in any case it is a simple increment. But put_task_struct() is complex, given that it could at some point free the task struct, and this process has many steps which I couldn't manually verify. However, several other places call put_task_struct() under RCU, so it appears safe to use here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296) Patch description: pid_revalidate() drops from RCU into REF lookup mode. When many threads are resolving paths within /proc in parallel, this can result in heavy spinlock contention on d_lockref as each thread tries to grab a reference to the /proc dentry (and drop it shortly thereafter). Investigation indicates that it is not necessary to drop RCU in pid_revalidate(), as no RCU data is modified and the function never sleeps. So, remove the LOOKUP_RCU check. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stephen Brennan <[email protected]> Cc: Konrad Wilk <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09virtio-mem: kdump mode to sanitize /proc/vmcore accessDavid Hildenbrand1-12/+124
Although virtio-mem currently supports reading unplugged memory in the hypervisor, this will change in the future, indicated to the device via a new feature flag. We similarly sanitized /proc/kcore access recently. [1] Let's register a vmcore callback, to allow vmcore code to check if a PFN belonging to a virtio-mem device is either currently plugged and should be dumped or is currently unplugged and should not be accessed, instead mapping the shared zeropage or returning zeroes when reading. This is important when not capturing /proc/vmcore via tools like "makedumpfile" that can identify logically unplugged virtio-mem memory via PG_offline in the memmap, but simply by e.g., copying the file. Distributions that support virtio-mem+kdump have to make sure that the virtio_mem module will be part of the kdump kernel or the kdump initrd; dracut was recently [2] extended to include virtio-mem in the generated initrd. As long as no special kdump kernels are used, this will automatically make sure that virtio-mem will be around in the kdump initrd and sanitize /proc/vmcore access -- with dracut. With this series, we'll send one virtio-mem state request for every ~2 MiB chunk of virtio-mem memory indicated in the vmcore that we intend to read/map. In the future, we might want to allow building virtio-mem for kdump mode only, even without CONFIG_MEMORY_HOTPLUG and friends: this way, we could support special stripped-down kdump kernels that have many other config options disabled; we'll tackle that once required. Further, we might want to try sensing bigger blocks (e.g., memory sections) first before falling back to device blocks on demand. Tested with Fedora rawhide, which contains a recent kexec-tools version (considering "System RAM (virtio_mem)" when creating the vmcore header) and a recent dracut version (including the virtio_mem module in the kdump initrd). Link: https://lkml.kernel.org/r/[email protected] [1] Link: https://github.com/dracutdevs/dracut/pull/1157 [2] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Young <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Wang <[email protected]> Cc: Juergen Gross <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09virtio-mem: factor out hotplug specifics from virtio_mem_remove() into ↵David Hildenbrand1-3/+10
virtio_mem_deinit_hotplug() Let's prepare for a new virtio-mem kdump mode in which we don't actually hot(un)plug any memory but only observe the state of device blocks. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Young <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Wang <[email protected]> Cc: Juergen Gross <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09virtio-mem: factor out hotplug specifics from virtio_mem_probe() into ↵David Hildenbrand1-42/+45
virtio_mem_init_hotplug() Let's prepare for a new virtio-mem kdump mode in which we don't actually hot(un)plug any memory but only observe the state of device blocks. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Young <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Wang <[email protected]> Cc: Juergen Gross <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09virtio-mem: factor out hotplug specifics from virtio_mem_init() into ↵David Hildenbrand1-37/+44
virtio_mem_init_hotplug() Let's prepare for a new virtio-mem kdump mode in which we don't actually hot(un)plug any memory but only observe the state of device blocks. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Young <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Wang <[email protected]> Cc: Juergen Gross <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacksDavid Hildenbrand4-38/+111
Let's support multiple registered callbacks, making sure that registering vmcore callbacks cannot fail. Make the callback return a bool instead of an int, handling how to deal with errors internally. Drop unused HAVE_OLDMEM_PFN_IS_RAM. We soon want to make use of this infrastructure from other drivers: virtio-mem, registering one callback for each virtio-mem device, to prevent reading unplugged virtio-mem memory. Handle it via a generic vmcore_cb structure, prepared for future extensions: for example, once we support virtio-mem on s390x where the vmcore is completely constructed in the second kernel, we want to detect and add plugged virtio-mem memory ranges to the vmcore in order for them to get dumped properly. Handle corner cases that are unexpected and shouldn't happen in sane setups: registering a callback after the vmcore has already been opened (warn only) and unregistering a callback after the vmcore has already been opened (warn and essentially read only zeroes from that point on). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Young <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Wang <[email protected]> Cc: Juergen Gross <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09proc/vmcore: let pfn_is_ram() return a boolDavid Hildenbrand1-4/+4
The callback should deal with errors internally, it doesn't make sense to expose these via pfn_is_ram(). We'll rework the callbacks next. Right now we consider errors as if "it's RAM"; no functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Young <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Wang <[email protected]> Cc: Juergen Gross <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09x86/xen: print a warning when HVMOP_get_mem_type failsDavid Hildenbrand1-1/+3
HVMOP_get_mem_type is not expected to fail, "This call failing is indication of something going quite wrong and it would be good to know about this." [1] Let's add a pr_warn_once(). Link: https://lkml.kernel.org/r/[email protected] [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Suggested-by: Boris Ostrovsky <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Young <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Wang <[email protected]> Cc: Juergen Gross <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09x86/xen: simplify xen_oldmem_pfn_is_ram()David Hildenbrand1-14/+1
Let's simplify return handling. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Young <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Wang <[email protected]> Cc: Juergen Gross <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09x86/xen: update xen_oldmem_pfn_is_ram() documentationDavid Hildenbrand1-6/+3
After removing /dev/kmem, sanitizing /proc/kcore and handling /dev/mem, this series tackles the last sane way how a VM could accidentially access logically unplugged memory managed by a virtio-mem device: /proc/vmcore When dumping memory via "makedumpfile", PG_offline pages, used by virtio-mem to flag logically unplugged memory, are already properly excluded; however, especially when accessing/copying /proc/vmcore "the usual way", we can still end up reading logically unplugged memory part of a virtio-mem device. Patch #1-#3 are cleanups. Patch #4 extends the existing oldmem_pfn_is_ram mechanism. Patch #5-#7 are virtio-mem refactorings for patch #8, which implements the virtio-mem logic to query the state of device blocks. Patch #8: "Although virtio-mem currently supports reading unplugged memory in the hypervisor, this will change in the future, indicated to the device via a new feature flag. We similarly sanitized /proc/kcore access recently. [...] Distributions that support virtio-mem+kdump have to make sure that the virtio_mem module will be part of the kdump kernel or the kdump initrd; dracut was recently [2] extended to include virtio-mem in the generated initrd. As long as no special kdump kernels are used, this will automatically make sure that virtio-mem will be around in the kdump initrd and sanitize /proc/vmcore access -- with dracut" This is the last remaining bit to support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE [3] in the Linux implementation of virtio-mem. Note: this is best-effort. We'll never be able to control what runs inside the second kernel, really, but we also don't have to care: we only care about sane setups where we don't want our VM getting zapped once we touch the wrong memory location while dumping. While we usually expect sane setups to use "makedumfile", nothing really speaks against just copying /proc/vmcore, especially in environments where HWpoisioning isn't typically expected. Also, we really don't want to put all our trust completely on the memmap, so sanitizing also makes sense when just using "makedumpfile". [1] https://lkml.kernel.org/r/[email protected] [2] https://github.com/dracutdevs/dracut/pull/1157 [3] https://lists.oasis-open.org/archives/virtio-comment/202109/msg00021.html This patch (of 9): The callback is only used for the vmcore nowadays. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Boris Ostrovsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Jason Wang <[email protected]> Cc: Dave Young <[email protected]> Cc: Baoquan He <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09procfs: do not list TID 0 in /proc/<pid>/taskFlorian Weimer4-0/+87
If a task exits concurrently, task_pid_nr_ns may return 0. [[email protected]: coding style tweaks] [[email protected]: test that /proc/*/task doesn't contain "0"] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Florian Weimer <[email protected]> Signed-off-by: Alexey Dobriyan <[email protected]> Acked-by: Christian Brauner <[email protected]> Reviewed-by: Alexey Dobriyan <[email protected]> Cc: Kees Cook <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09mm,hugetlb: remove mlock ulimit for SHM_HUGETLBzhangyiru5-31/+13
Commit 21a3c273f88c ("mm, hugetlb: add thread name and pid to SHM_HUGETLB mlock rlimit warning") marked this as deprecated in 2012, but it is not deleted yet. Mike says he still sees that message in log files on occasion, so maybe we should preserve this warning. Also remove hugetlbfs related user_shm_unlock in ipc/shm.c and remove the user_shm_unlock after out. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: zhangyiru <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Liu Zixian <[email protected]> Cc: Michal Hocko <[email protected]> Cc: wuxu.wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09vfs: keep inodes with page cache off the inode shrinker LRUJohannes Weiner8-22/+120
Historically (pre-2.5), the inode shrinker used to reclaim only empty inodes and skip over those that still contained page cache. This caused problems on highmem hosts: struct inode could put fill lowmem zones before the cache was getting reclaimed in the highmem zones. To address this, the inode shrinker started to strip page cache to facilitate reclaiming lowmem. However, this comes with its own set of problems: the shrinkers may drop actively used page cache just because the inodes are not currently open or dirty - think working with a large git tree. It further doesn't respect cgroup memory protection settings and can cause priority inversions between containers. Nowadays, the page cache also holds non-resident info for evicted cache pages in order to detect refaults. We've come to rely heavily on this data inside reclaim for protecting the cache workingset and driving swap behavior. We also use it to quantify and report workload health through psi. The latter in turn is used for fleet health monitoring, as well as driving automated memory sizing of workloads and containers, proactive reclaim and memory offloading schemes. The consequences of dropping page cache prematurely is that we're seeing subtle and not-so-subtle failures in all of the above-mentioned scenarios, with the workload generally entering unexpected thrashing states while losing the ability to reliably detect it. To fix this on non-highmem systems at least, going back to rotating inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 ("mm: don't reclaim inodes with many attached pages")) and failed (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many attached pages"")). The issue is mostly that shrinker pools attract pressure based on their size, and when objects get skipped the shrinkers remember this as deferred reclaim work. This accumulates excessive pressure on the remaining inodes, and we can quickly eat into heavily used ones, or dirty ones that require IO to reclaim, when there potentially is plenty of cold, clean cache around still. Instead, this patch keeps populated inodes off the inode LRU in the first place - just like an open file or dirty state would. An otherwise clean and unused inode then gets queued when the last cache entry disappears. This solves the problem without reintroducing the reclaim issues, and generally is a bit more scalable than having to wade through potentially hundreds of thousands of busy inodes. Locking is a bit tricky because the locks protecting the inode state (i_lock) and the inode LRU (lru_list.lock) don't nest inside the irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are serialized through i_lock, taken before the i_pages lock, to make sure depopulated inodes are queued reliably. Additions may race with deletions, but we'll check again in the shrinker. If additions race with the shrinker itself, we're protected by the i_lock: if find_inode() or iput() win, the shrinker will bail on the elevated i_count or I_REFERENCED; if the shrinker wins and goes ahead with the inode, it will set I_FREEING and inhibit further igets(), which will cause the other side to create a new instance of the inode instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Dave Chinner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-09f2fs: fix UAF in f2fs_available_free_memoryDongliang Mu1-0/+2
if2fs_fill_super -> f2fs_build_segment_manager -> create_discard_cmd_control -> f2fs_start_discard_thread It invokes kthread_run to create a thread and run issue_discard_thread. However, if f2fs_build_node_manager fails, the control flow goes to free_nm and calls f2fs_destroy_node_manager. This function will free sbi->nm_info. However, if issue_discard_thread accesses sbi->nm_info after the deallocation, but before the f2fs_stop_discard_thread, it will cause UAF(Use-after-free). -> f2fs_destroy_segment_manager -> destroy_discard_cmd_control -> f2fs_stop_discard_thread Fix this by stopping discard thread before f2fs_destroy_node_manager. Note that, the commit d6d2b491a82e1 introduces the call of f2fs_available_free_memory into issue_discard_thread. Cc: [email protected] Fixes: d6d2b491a82e ("f2fs: allow to change discard policy based on cached discard cmds") Signed-off-by: Dongliang Mu <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2021-11-09f2fs: invalidate META_MAPPING before IPU/DIO writeHyeong-Jun Kim2-0/+5
Encrypted pages during GC are read and cached in META_MAPPING. However, due to cached pages in META_MAPPING, there is an issue where newly written pages are lost by IPU or DIO writes. Thread A - f2fs_gc() Thread B /* phase 3 */ down_write(i_gc_rwsem) ra_data_block() ---- (a) up_write(i_gc_rwsem) f2fs_direct_IO() : - down_read(i_gc_rwsem) - __blockdev_direct_io() - get_data_block_dio_write() - f2fs_dio_submit_bio() ---- (b) - up_read(i_gc_rwsem) /* phase 4 */ down_write(i_gc_rwsem) move_data_block() ---- (c) up_write(i_gc_rwsem) (a) In phase 3 of f2fs_gc(), up-to-date page is read from storage and cached in META_MAPPING. (b) In thread B, writing new data by IPU or DIO write on same blkaddr as read in (a). cached page in META_MAPPING become out-dated. (c) In phase 4 of f2fs_gc(), out-dated page in META_MAPPING is copied to new blkaddr. In conclusion, the newly written data in (b) is lost. To address this issue, invalidating pages in META_MAPPING before IPU or DIO write. Fixes: 6aa58d8ad20a ("f2fs: readahead encrypted block during GC") Signed-off-by: Hyeong-Jun Kim <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2021-11-09nvme: wait until quiesce is doneMing Lei1-0/+4
NVMe uses one atomic flag to check if quiesce is needed. If quiesce is started, the helper returns immediately. This way is wrong, since we have to wait until quiesce is done. Fixes: e70feb8b3e68 ("blk-mq: support concurrent queue quiesce/unquiesce") Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-11-09scsi: make sure that request queue queiesce and unquiesce balancedMing Lei2-9/+29
For fixing queue quiesce race between driver and block layer(elevator switch, update nr_requests, ...), we need to support concurrent quiesce and unquiesce, which requires the two call balanced. It isn't easy to audit that in all scsi drivers, especially the two may be called from different contexts, so do it in scsi core with one per-device atomic variable to balance quiesce and unquiesce. Reported-by: Yi Zhang <[email protected]> Fixes: e70feb8b3e68 ("blk-mq: support concurrent queue quiesce/unquiesce") Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-11-09scsi: avoid to quiesce sdev->request_queue two timesMing Lei1-15/+14
For fixing queue quiesce race between driver and block layer(elevator switch, update nr_requests, ...), we need to support concurrent quiesce and unquiesce, which requires the two to be balanced. blk_mq_quiesce_queue() calls blk_mq_quiesce_queue_nowait() for updating quiesce depth and marking the flag, then scsi_internal_device_block() calls blk_mq_quiesce_queue_nowait() two times actually. Fix the double quiesce and keep quiesce and unquiesce balanced. Reported-by: Yi Zhang <[email protected]> Fixes: e70feb8b3e68 ("blk-mq: support concurrent queue quiesce/unquiesce") Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-11-09blk-mq: add one API for waiting until quiesce is doneMing Lei2-8/+21
Some drivers(NVMe, SCSI) need to call quiesce and unquiesce in pair, but it is hard to switch to this style, so these drivers need one atomic flag for helping to balance quiesce and unquiesce. When quiesce is in-progress, the driver still needs to wait until the quiesce is done, so add API of blk_mq_wait_quiesce_done() for these drivers. Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-11-09MIPS: fix duplicated slashes for Platform file pathMasahiro Yamada1-1/+1
platform-y accumulates platform names with a slash appended. The current $(patsubst ...) ends up with doubling slashes. GNU Make still include Platform files, but in case of an error, a clumsy file path is displayed: arch/mips/loongson2ef//Platform:36: *** only binutils >= 2.20.2 have needed option -mfix-loongson2f-nop. Stop. Signed-off-by: Masahiro Yamada <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
2021-11-09MIPS: fix *-pkg builds for loongson2ef platformMasahiro Yamada1-0/+2
Since commit 805b2e1d427a ("kbuild: include Makefile.compiler only when compiler is needed"), package builds for the loongson2f platform fail. $ make ARCH=mips CROSS_COMPILE=mips64-linux- lemote2f_defconfig bindeb-pkg [ snip ] sh ./scripts/package/builddeb arch/mips/loongson2ef//Platform:36: *** only binutils >= 2.20.2 have needed option -mfix-loongson2f-nop. Stop. cp: cannot stat '': No such file or directory make[5]: *** [scripts/Makefile.package:87: intdeb-pkg] Error 1 make[4]: *** [Makefile:1558: intdeb-pkg] Error 2 make[3]: *** [debian/rules:13: binary-arch] Error 2 dpkg-buildpackage: error: debian/rules binary subprocess returned exit status 2 make[2]: *** [scripts/Makefile.package:83: bindeb-pkg] Error 2 make[1]: *** [Makefile:1558: bindeb-pkg] Error 2 make: *** [Makefile:350: __build_one_by_one] Error 2 The reason is because "make image_name" fails. $ make ARCH=mips CROSS_COMPILE=mips64-linux- image_name arch/mips/loongson2ef//Platform:36: *** only binutils >= 2.20.2 have needed option -mfix-loongson2f-nop. Stop. In general, adding $(error ...) in the parse stage is troublesome, and it is pointless to check toolchains even if we are not building anything. Do not include Kbuild.platform in such cases. Fixes: 805b2e1d427a ("kbuild: include Makefile.compiler only when compiler is needed") Reported-by: Jason Self <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
2021-11-09PCI: brcmstb: Allow building for BMIPS_GENERICFlorian Fainelli1-1/+2
BMIPS_GENERIC denotes support for the MIPS-based Broadcom STB platforms which this driver can support. Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
2021-11-09MIPS: BMIPS: Enable PCI KconfigFlorian Fainelli1-0/+2
Enable HAVE_PCI and PCI_DRIVERS_GENERIC so we can build PCIE_BRCMSTB which is the PCIe host bridge driver for this platform. Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
2021-11-09MIPS: VDSO: remove -nostdlib compiler flagMasahiro Yamada1-1/+1
The -nostdlib option requests the compiler to not use the standard system startup files or libraries when linking. It is effective only when $(CC) is used as a linker driver. Since commit 2ff906994b6c ("MIPS: VDSO: Use $(LD) instead of $(CC) to link VDSO"), $(LD) is directly used, hence -nostdlib is unneeded. Signed-off-by: Masahiro Yamada <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
2021-11-09mips: BCM63XX: ensure that CPU_SUPPORTS_32BIT_KERNEL is setRandy Dunlap1-0/+3
Several header files need info on CONFIG_32BIT or CONFIG_64BIT, but kconfig symbol BCM63XX does not provide that info. This leads to many build errors, e.g.: arch/mips/include/asm/page.h:196:13: error: use of undeclared identifier 'CAC_BASE' return x - PAGE_OFFSET + PHYS_OFFSET; arch/mips/include/asm/mach-generic/spaces.h:91:23: note: expanded from macro 'PAGE_OFFSET' #define PAGE_OFFSET (CAC_BASE + PHYS_OFFSET) arch/mips/include/asm/io.h:134:28: error: use of undeclared identifier 'CAC_BASE' return (void *)(address + PAGE_OFFSET - PHYS_OFFSET); arch/mips/include/asm/mach-generic/spaces.h:91:23: note: expanded from macro 'PAGE_OFFSET' #define PAGE_OFFSET (CAC_BASE + PHYS_OFFSET) arch/mips/include/asm/uaccess.h:82:10: error: use of undeclared identifier '__UA_LIMIT' return (__UA_LIMIT & (addr | (addr + size) | __ua_size(size))) == 0; Selecting the SYS_HAS_CPU_BMIPS* symbols causes SYS_HAS_CPU_BMIPS to be set, which then selects CPU_SUPPORT_32BIT_KERNEL, which causes CONFIG_32BIT to be set. (a bit more indirect than v1 [RFC].) Fixes: e7300d04bd08 ("MIPS: BCM63xx: Add support for the Broadcom BCM63xx family of SOCs.") Signed-off-by: Randy Dunlap <[email protected]> Reported-by: kernel test robot <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Paul Burton <[email protected]> Cc: Maxime Bizon <[email protected]> Cc: Ralf Baechle <[email protected]> Suggested-by: Florian Fainelli <[email protected]> Acked-by: Florian Fainelli <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
2021-11-09MIPS: Update bmips_stb_defconfigFlorian Fainelli1-8/+147
Align the bmips_stb_defconfig with its downstream version at: https://github.com/Broadcom/stblinux-4.1/blob/master/linux/arch/mips/configs/bmips_stb_defconfig to be slightly more useful and include support for all of these options: - latest Broadcom STB drivers - support for high resolution timers - cpufreq - function tracers - extending command line from DTB - task lockup detector - strong stack protector support - IP auto-configuration Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
2021-11-09MIPS: Allow modules to set board_be_handlerFlorian Fainelli12-14/+20
After making the brcmstb_gisb driver modular with 707a4cdf86e5 ("bus: brcmstb_gisb: Allow building as module") Guenter reported that mips allmodconfig failed to link because board_be_handler was referenced. Thomas indicated that if we were to continue making the brcmstb_gisb driver modular for MIPS we would need to introduce a function that allows setting the board_be_handler and export that function towards modules. This is what is being done here: board_be_handler is made static and is now settable with a mips_set_be_handler() function which is exported. Reported-by: Guenter Roeck <[email protected]> Suggested-by: Thomas Bogendoerfer <[email protected]> Fixes: 707a4cdf86e5 ("bus: brcmstb_gisb: Allow building as module") Signed-off-by: Florian Fainelli <[email protected]> Tested-by: Guenter Roeck <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
2021-11-09drm/i915/adlp/fb: Prevent the mapping of redundant trailing padding NULL pagesImre Deak2-1/+11
So far the remapped view size in GTT/DPT was padded to the next aligned offset unnecessarily after the last color plane with an unaligned size. Remove the unnecessary padding. Cc: Juha-Pekka Heikkila <[email protected]> Fixes: 3d1adc3d64cf ("drm/i915/adlp: Add support for remapping CCS FBs") Signed-off-by: Imre Deak <[email protected]> Reviewed-by: Juha-Pekka Heikkila <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 6b6636e17649d75b4d0cc55d3dff9e44511a442a) Signed-off-by: Rodrigo Vivi <[email protected]>
2021-11-09drm/i915/fb: Fix rounding error in subsampled plane size calculationImre Deak1-2/+2
For NV12 FBs with odd main surface tile-row height the CCS surface height was incorrectly calculated 1 less than the actual value. Fix this by rounding up the result of divison. For consistency do the same for the CCS surface width calculation. Fixes: b3e57bccd68a ("drm/i915/tgl: Gen-12 render decompression") Signed-off-by: Imre Deak <[email protected]> Reviewed-by: Juha-Pekka Heikkila <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 2ee5ef9c934ad26376c9282171e731e6c0339815) Signed-off-by: Rodrigo Vivi <[email protected]>
2021-11-09drm/i915/hdmi: Turn DP++ TMDS output buffers back on in encoder->shutdown()Ville Syrjälä4-2/+17
Looks like our VBIOS/GOP generally fail to turn the DP dual mode adater TMDS output buffers back on after a reboot. This leads to a black screen after reboot if we turned the TMDS output buffers off prior to reboot. And if i915 decides to do a fastboot the black screen will persist even after i915 takes over. Apparently this has been a problem ever since commit b2ccb822d376 ("drm/i915: Enable/disable TMDS output buffers in DP++ adaptor as needed") if one rebooted while the display was turned off. And things became worse with commit fe0f1e3bfdfe ("drm/i915: Shut down displays gracefully on reboot") since now we always turn the display off before a reboot. This was reported on a RKL, but I confirmed the same behaviour on my SNB as well. So looks pretty universal. Let's fix this by explicitly turning the TMDS output buffers back on in the encoder->shutdown() hook. Note that this gets called after irqs have been disabled, so the i2c communication with the DP dual mode adapter has to be performed via polling (which the gmbus code is perfectly happy to do for us). We also need a bit of care in handling DDI encoders which may or may not be set up for HDMI output. Specifically ddc_pin will not be populated for a DP only DDI encoder, in which case we don't want to call intel_gmbus_get_adapter(). We can handle that by simply doing the dual mode adapter type check before calling intel_gmbus_get_adapter(). Cc: <[email protected]> # v5.11+ Fixes: fe0f1e3bfdfe ("drm/i915: Shut down displays gracefully on reboot") Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/4371 Signed-off-by: Ville Syrjälä <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Reviewed-by: Stanislav Lisovskiy <[email protected]> (cherry picked from commit 49c55f7b035b87371a6d3c53d9af9f92ddc962db) Signed-off-by: Rodrigo Vivi <[email protected]>
2021-11-09amt: add IPV6 Kconfig dependencyArnd Bergmann1-0/+1
This driver cannot be built-in if IPV6 is a loadable module: x86_64-linux-ld: drivers/net/amt.o: in function `amt_build_mld_gq': amt.c:(.text+0x2e7d): undefined reference to `ipv6_dev_get_saddr' Add the idiomatic Kconfig dependency that all such modules have. Fixes: b9022b53adad ("amt: add control plane of amt interface") Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Taehee Yoo <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-11-09gve: Fix off by one in gve_tx_timeout()Dan Carpenter1-1/+1
The priv->ntfy_blocks[] has "priv->num_ntfy_blks" elements so this > needs to be >= to prevent an off by one bug. The priv->ntfy_blocks[] array is allocated in gve_alloc_notify_blocks(). Fixes: 87a7f321bb6a ("gve: Recover from queue stall due to missed IRQ") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-11-09btrfs: fix deadlock due to page faults during direct IO reads and writesFilipe Manana1-16/+123
If we do a direct IO read or write when the buffer given by the user is memory mapped to the file range we are going to do IO, we end up ending in a deadlock. This is triggered by the new test case generic/647 from fstests. For a direct IO read we get a trace like this: [967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds. [967.874161] Not tainted 5.14.0-rc7-btrfs-next-95 #1 [967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [967.875983] task:mmap-rw-fault state:D stack: 0 pid:12176 ppid: 11884 flags:0x00000000 [967.875992] Call Trace: [967.875999] __schedule+0x3ca/0xe10 [967.876015] schedule+0x43/0xe0 [967.876020] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs] [967.876109] ? do_wait_intr_irq+0xb0/0xb0 [967.876118] lock_extent_bits+0x37/0x90 [btrfs] [967.876150] btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs] [967.876184] ? extent_readahead+0xa7/0x530 [btrfs] [967.876214] extent_readahead+0x32d/0x530 [btrfs] [967.876253] ? lru_cache_add+0x104/0x220 [967.876255] ? kvm_sched_clock_read+0x14/0x40 [967.876258] ? sched_clock_cpu+0xd/0x110 [967.876263] ? lock_release+0x155/0x4a0 [967.876271] read_pages+0x86/0x270 [967.876274] ? lru_cache_add+0x125/0x220 [967.876281] page_cache_ra_unbounded+0x1a3/0x220 [967.876291] filemap_fault+0x626/0xa20 [967.876303] __do_fault+0x36/0xf0 [967.876308] __handle_mm_fault+0x83f/0x15f0 [967.876322] handle_mm_fault+0x9e/0x260 [967.876327] __get_user_pages+0x204/0x620 [967.876332] ? get_user_pages_unlocked+0x69/0x340 [967.876340] get_user_pages_unlocked+0xd3/0x340 [967.876349] internal_get_user_pages_fast+0xbca/0xdc0 [967.876366] iov_iter_get_pages+0x8d/0x3a0 [967.876374] bio_iov_iter_get_pages+0x82/0x4a0 [967.876379] ? lock_release+0x155/0x4a0 [967.876387] iomap_dio_bio_actor+0x232/0x410 [967.876396] iomap_apply+0x12a/0x4a0 [967.876398] ? iomap_dio_rw+0x30/0x30 [967.876414] __iomap_dio_rw+0x29f/0x5e0 [967.876415] ? iomap_dio_rw+0x30/0x30 [967.876420] ? lock_acquired+0xf3/0x420 [967.876429] iomap_dio_rw+0xa/0x30 [967.876431] btrfs_file_read_iter+0x10b/0x140 [btrfs] [967.876460] new_sync_read+0x118/0x1a0 [967.876472] vfs_read+0x128/0x1b0 [967.876477] __x64_sys_pread64+0x90/0xc0 [967.876483] do_syscall_64+0x3b/0xc0 [967.876487] entry_SYSCALL_64_after_hwframe+0x44/0xae [967.876490] RIP: 0033:0x7fb6f2c038d6 [967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 [967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6 [967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003 [967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000 [967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003 [967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000 This happens because at btrfs_dio_iomap_begin() we lock the extent range and return with it locked - we only unlock in the endio callback, at end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after iomap called the btrfs_dio_iomap_begin() callback, it triggers the page faults that resulting in reading the pages, through the readahead callback btrfs_readahead(), and through there we end to attempt to lock again the same extent range (or a subrange of what we locked before), resulting in the deadlock. For a direct IO write, the scenario is a bit different, and it results in trace like this: [1132.442520] run fstests generic/647 at 2021-08-31 18:53:35 [1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds. [1330.350540] Not tainted 5.14.0-rc7-btrfs-next-95 #1 [1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1330.351900] task:mmap-rw-fault state:D stack: 0 pid:184017 ppid:183725 flags:0x00000000 [1330.351906] Call Trace: [1330.351913] __schedule+0x3ca/0xe10 [1330.351930] schedule+0x43/0xe0 [1330.351935] btrfs_start_ordered_extent+0x108/0x1c0 [btrfs] [1330.352020] ? do_wait_intr_irq+0xb0/0xb0 [1330.352028] btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs] [1330.352064] ? extent_readahead+0xa7/0x530 [btrfs] [1330.352094] extent_readahead+0x32d/0x530 [btrfs] [1330.352133] ? lru_cache_add+0x104/0x220 [1330.352135] ? kvm_sched_clock_read+0x14/0x40 [1330.352138] ? sched_clock_cpu+0xd/0x110 [1330.352143] ? lock_release+0x155/0x4a0 [1330.352151] read_pages+0x86/0x270 [1330.352155] ? lru_cache_add+0x125/0x220 [1330.352162] page_cache_ra_unbounded+0x1a3/0x220 [1330.352172] filemap_fault+0x626/0xa20 [1330.352176] ? filemap_map_pages+0x18b/0x660 [1330.352184] __do_fault+0x36/0xf0 [1330.352189] __handle_mm_fault+0x1253/0x15f0 [1330.352203] handle_mm_fault+0x9e/0x260 [1330.352208] __get_user_pages+0x204/0x620 [1330.352212] ? get_user_pages_unlocked+0x69/0x340 [1330.352220] get_user_pages_unlocked+0xd3/0x340 [1330.352229] internal_get_user_pages_fast+0xbca/0xdc0 [1330.352246] iov_iter_get_pages+0x8d/0x3a0 [1330.352254] bio_iov_iter_get_pages+0x82/0x4a0 [1330.352259] ? lock_release+0x155/0x4a0 [1330.352266] iomap_dio_bio_actor+0x232/0x410 [1330.352275] iomap_apply+0x12a/0x4a0 [1330.352278] ? iomap_dio_rw+0x30/0x30 [1330.352292] __iomap_dio_rw+0x29f/0x5e0 [1330.352294] ? iomap_dio_rw+0x30/0x30 [1330.352306] btrfs_file_write_iter+0x238/0x480 [btrfs] [1330.352339] new_sync_write+0x11f/0x1b0 [1330.352344] ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e [1330.352354] vfs_write+0x292/0x3c0 [1330.352359] __x64_sys_pwrite64+0x90/0xc0 [1330.352365] do_syscall_64+0x3b/0xc0 [1330.352369] entry_SYSCALL_64_after_hwframe+0x44/0xae [1330.352372] RIP: 0033:0x7f4b0a580986 [1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012 [1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986 [1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003 [1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000 [1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 [1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent range unlocked, but later when the page faults are triggered and we try to read the extents, we end up btrfs_lock_and_flush_ordered_range() where we find the ordered extent for our write, created by the iomap callback btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us deadlock since we can't complete the ordered extent without reading the pages (the iomap code only submits the bio after the pages are faulted in). Fix this by setting the nofault attribute of the given iov_iter and retry the direct IO read/write if we get an -EFAULT error returned from iomap. For reads, also disable page faults completely, this is because when we read from a hole or a prealloc extent, we can still trigger page faults due to the call to iov_iter_zero() done by iomap - at the moment, it is oblivious to the value of the ->nofault attribute of an iov_iter. We also need to keep track of the number of bytes written or read, and pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL. This depends on the iov_iter and iomap changes introduced in commit c03098d4b9ad ("Merge tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2"). Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-11-09hamradio: defer 6pack kfree after unregister_netdevLin Ma1-2/+4
There is a possible race condition (use-after-free) like below (USE) | (FREE) dev_queue_xmit | __dev_queue_xmit | __dev_xmit_skb | sch_direct_xmit | ... xmit_one | netdev_start_xmit | tty_ldisc_kill __netdev_start_xmit | 6pack_close sp_xmit | kfree sp_encaps | | According to the patch "defer ax25 kfree after unregister_netdev", this patch reorder the kfree after the unregister_netdev to avoid the possible UAF as the unregister_netdev() is well synchronized and won't return if there is a running routine. Signed-off-by: Lin Ma <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-11-09hamradio: defer ax25 kfree after unregister_netdevLin Ma1-4/+5
There is a possible race condition (use-after-free) like below (USE) | (FREE) ax25_sendmsg | ax25_queue_xmit | dev_queue_xmit | __dev_queue_xmit | __dev_xmit_skb | sch_direct_xmit | ... xmit_one | netdev_start_xmit | tty_ldisc_kill __netdev_start_xmit | mkiss_close ax_xmit | kfree ax_encaps | | Even though there are two synchronization primitives before the kfree: 1. wait_for_completion(&ax->dead). This can prevent the race with routines from mkiss_ioctl. However, it cannot stop the routine coming from upper layer, i.e., the ax25_sendmsg. 2. netif_stop_queue(ax->dev). It seems that this line of code aims to halt the transmit queue but it fails to stop the routine that already being xmit. This patch reorder the kfree after the unregister_netdev to avoid the possible UAF as the unregister_netdev() is well synchronized and won't return if there is a running routine. Signed-off-by: Lin Ma <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-11-09net: sungem_phy: fix code indentationJean Sacren1-1/+1
Remove extra space in front of the return statement. Fixes: eb5b5b2ff96e ("sungem_phy: support bcm5461 phy, autoneg.") Signed-off-by: Jean Sacren <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-11-09drm/locking: fix __stack_depot_* name conflictStephen Rothwell1-7/+7
Commit cd06ab2fd48f ("drm/locking: add backtrace for locking contended locks without backoff") added functions named __stack_depot_* in drm which conflict with stack depot. Rename to __drm_stack_depot_*. v2 by Jani: - Also rename __stack_depot_print References: https://lore.kernel.org/r/[email protected] Fixes: cd06ab2fd48f ("drm/locking: add backtrace for locking contended locks without backoff") Cc: Daniel Vetter <[email protected]> Reviewed-by: Daniel Vetter <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: Jani Nikula <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit c4f08d7246a520da5f2b1068f635da0678485e33)
2021-11-09ALSA: synth: missing check for possible NULL after the call to kstrdupAustin Kim1-1/+1
If kcalloc() return NULL due to memory starvation, it is possible for kstrdup() to return NULL in similar case. So add null check after the call to kstrdup() is made. [ minor coding-style fix by tiwai ] Signed-off-by: Austin Kim <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/20211109003742.GA5423@raspberrypi Signed-off-by: Takashi Iwai <[email protected]>
2021-11-09ALSA: memalloc: Use proper SG helpers for noncontig allocationsTakashi Iwai1-3/+61
The recently introduced non-contiguous page allocation support helpers are using the simplified code to calculate the page and DMA address based on the vmalloc helpers, but this isn't quite right as the vmap is valid only for the direct DMA. This patch corrects those accessors to use the proper SG helpers instead. Fixes: a25684a95646 ("ALSA: memalloc: Support for non-contiguous page allocation") Tested-by: Alex Xu (Hello71) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Iwai <[email protected]>
2021-11-09dmaengine: ti: k3-udma: Set r/tchan or rflow to NULL if request failKishon Vijay Abraham I1-4/+20
udma_get_*() checks if rchan/tchan/rflow is already allocated by checking if it has a NON NULL value. For the error cases, rchan/tchan/rflow will have error value and udma_get_*() considers this as already allocated (PASS) since the error values are NON NULL. This results in NULL pointer dereference error while de-referencing rchan/tchan/rflow. Reset the value of rchan/tchan/rflow to NULL if a channel request fails. CC: [email protected] Acked-by: Peter Ujfalusi <[email protected]> Signed-off-by: Kishon Vijay Abraham I <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Vinod Koul <[email protected]>
2021-11-09dmaengine: ti: k3-udma: Set bchan to NULL if a channel request failKishon Vijay Abraham I1-2/+6
bcdma_get_*() checks if bchan is already allocated by checking if it has a NON NULL value. For the error cases, bchan will have error value and bcdma_get_*() considers this as already allocated (PASS) since the error values are NON NULL. This results in NULL pointer dereference error while de-referencing bchan. Reset the value of bchan to NULL if a channel request fails. CC: [email protected] Acked-by: Peter Ujfalusi <[email protected]> Signed-off-by: Kishon Vijay Abraham I <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Vinod Koul <[email protected]>
2021-11-09dmaengine: stm32-dma: avoid 64-bit division in stm32_dma_get_max_widthArnd Bergmann1-3/+3
Using the % operator on a 64-bit variable is expensive and can cause a link failure: arm-linux-gnueabi-ld: drivers/dma/stm32-dma.o: in function `stm32_dma_get_max_width': stm32-dma.c:(.text+0x170): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/dma/stm32-dma.o: in function `stm32_dma_set_xfer_param': stm32-dma.c:(.text+0x1cd4): undefined reference to `__aeabi_uldivmod' As we know that we just want to check the alignment in stm32_dma_get_max_width(), there is no need for a full division, and using a simple mask is a faster replacement. Same in stm32_dma_set_xfer_param(), change this to only allow burst transfers if the address is a multiple of the length. stm32_dma_get_best_burst just after will take buf_len into account to fix burst in case of misalignment. Fixes: b20fd5fa310c ("dmaengine: stm32-dma: fix stm32_dma_get_max_width") Reported-by: kernel test robot <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Amelie Delaunay <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Vinod Koul <[email protected]>
2021-11-09crypto: api - Fix boot-up crash when crypto manager is disabledHerbert Xu1-0/+2
When the crypto manager is disabled, we need to explicitly set the crypto algorithms' tested status so that they can be used. Fixes: cad439fc040e ("crypto: api - Do not create test larvals if...") Reported-by: Geert Uytterhoeven <[email protected]> Reported-by: Ido Schimmel <[email protected]> Reported-by: Guenter Roeck <[email protected]> Signed-off-by: Herbert Xu <[email protected]> Tested-by: Ido Schimmel <[email protected]> Tested-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2021-11-08lib: zstd: Add cast to silence clang's -Wbitwise-instead-of-logicalNathan Chancellor1-1/+1
A new warning in clang warns that there is an instance where boolean expressions are being used with bitwise operators instead of logical ones: lib/zstd/decompress/huf_decompress.c:890:25: warning: use of bitwise '&' with boolean operands [-Wbitwise-instead-of-logical] (BIT_reloadDStreamFast(&bitD1) == BIT_DStream_unfinished) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ zstd does this frequently to help with performance, as logical operators have branches whereas bitwise ones do not. To fix this warning in other cases, the expressions were placed on separate lines with the '&=' operator; however, this particular instance was moved away from that so that it could be surrounded by LIKELY, which is a macro for __builtin_expect(), to help with a performance regression, according to upstream zstd pull #1973. Aside from switching to logical operators, which is likely undesirable in this instance, or disabling the warning outright, the solution is casting one of the expressions to an integer type to make it clear to clang that the author knows what they are doing. Add a cast to U32 to silence the warning. The first U32 cast is to silence an instance of -Wshorten-64-to-32 because __builtin_expect() returns long so it cannot be moved. Link: https://github.com/ClangBuiltLinux/linux/issues/1486 Link: https://github.com/facebook/zstd/pull/1973 Reported-by: Nick Desaulniers <[email protected]> Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Nick Terrell <[email protected]>
2021-11-08MAINTAINERS: Add maintainer entry for zstdNick Terrell1-0/+12
Adds a maintainer entry for zstd listing myself as the maintainer for all zstd code, pointing to the upstream issues tracker for bugs, and listing my linux repo as the tree. Signed-off-by: Nick Terrell <[email protected]>
2021-11-08lib: zstd: Upgrade to latest upstream zstd version 1.4.10Nick Terrell68-12993/+26861
Upgrade to the latest upstream zstd version 1.4.10. This patch is 100% generated from upstream zstd commit 20821a46f412 [0]. This patch is very large because it is transitioning from the custom kernel zstd to using upstream directly. The new zstd follows upstreams file structure which is different. Future update patches will be much smaller because they will only contain the changes from one upstream zstd release. As an aid for review I've created a commit [1] that shows the diff between upstream zstd as-is (which doesn't compile), and the zstd code imported in this patch. The verion of zstd in this patch is generated from upstream with changes applied by automation to replace upstreams libc dependencies, remove unnecessary portability macros, replace `/**` comments with `/*` comments, and use the kernel's xxhash instead of bundling it. The benefits of this patch are as follows: 1. Using upstream directly with automated script to generate kernel code. This allows us to update the kernel every upstream release, so the kernel gets the latest bug fixes and performance improvements, and doesn't get 3 years out of date again. The automation and the translated code are tested every upstream commit to ensure it continues to work. 2. Upgrades from a custom zstd based on 1.3.1 to 1.4.10, getting 3 years of performance improvements and bug fixes. On x86_64 I've measured 15% faster BtrFS and SquashFS decompression+read speeds, 35% faster kernel decompression, and 30% faster ZRAM decompression+read speeds. 3. Zstd-1.4.10 supports negative compression levels, which allow zstd to match or subsume lzo's performance. 4. Maintains the same kernel-specific wrapper API, so no callers have to be modified with zstd version updates. One concern that was brought up was stack usage. Upstream zstd had already removed most of its heavy stack usage functions, but I just removed the last functions that allocate arrays on the stack. I've measured the high water mark for both compression and decompression before and after this patch. Decompression is approximately neutral, using about 1.2KB of stack space. Compression levels up to 3 regressed from 1.4KB -> 1.6KB, and higher compression levels regressed from 1.5KB -> 2KB. We've added unit tests upstream to prevent further regression. I believe that this is a reasonable increase, and if it does end up causing problems, this commit can be cleanly reverted, because it only touches zstd. I chose the bulk update instead of replaying upstream commits because there have been ~3500 upstream commits since the 1.3.1 release, zstd wasn't ready to be used in the kernel as-is before a month ago, and not all upstream zstd commits build. The bulk update preserves bisectablity because bugs can be bisected to the zstd version update. At that point the update can be reverted, and we can work with upstream to find and fix the bug. Note that upstream zstd release 1.4.10 doesn't exist yet. I have cut a staging branch at 20821a46f412 [0] and will apply any changes requested to the staging branch. Once we're ready to merge this update I will cut a zstd release at the commit we merge, so we have a known zstd release in the kernel. The implementation of the kernel API is contained in zstd_compress_module.c and zstd_decompress_module.c. [0] https://github.com/facebook/zstd/commit/20821a46f4122f9abd7c7b245d28162dde8129c9 [1] https://github.com/terrelln/linux/commit/e0fa481d0e3df26918da0a13749740a1f6777574 Signed-off-by: Nick Terrell <[email protected]> Tested By: Paul Jones <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Tested-by: Sedat Dilek <[email protected]> # LLVM/Clang v13.0.0 on x86-64 Tested-by: Jean-Denis Girard <[email protected]>
2021-11-08lib: zstd: Add decompress_sources.h for decompress_unzstdNick Terrell2-5/+24
Adds decompress_sources.h which includes every .c file necessary for zstd decompression. This is used in decompress_unzstd.c so the internal structure of the library isn't exposed. This allows us to upgrade the zstd library version without modifying any callers. Instead we just need to update decompress_sources.h. Signed-off-by: Nick Terrell <[email protected]> Tested By: Paul Jones <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Tested-by: Sedat Dilek <[email protected]> # LLVM/Clang v13.0.0 on x86-64 Tested-by: Jean-Denis Girard <[email protected]>
2021-11-08lib: zstd: Add kernel-specific APINick Terrell11-1158/+1691
This patch: - Moves `include/linux/zstd.h` -> `include/linux/zstd_lib.h` - Updates modified zstd headers to yearless copyright - Adds a new API in `include/linux/zstd.h` that is functionally equivalent to the in-use subset of the current API. Functions are renamed to avoid symbol collisions with zstd, to make it clear it is not the upstream zstd API, and to follow the kernel style guide. - Updates all callers to use the new API. There are no functional changes in this patch. Since there are no functional change, I felt it was okay to update all the callers in a single patch. Once the API is approved, the callers are mechanically changed. This patch is preparing for the 3rd patch in this series, which updates zstd to version 1.4.10. Since the upstream zstd API is no longer exposed to callers, the update can happen transparently. Signed-off-by: Nick Terrell <[email protected]> Tested By: Paul Jones <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Tested-by: Sedat Dilek <[email protected]> # LLVM/Clang v13.0.0 on x86-64 Tested-by: Jean-Denis Girard <[email protected]>
2021-11-09bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_regJussi Maki2-6/+34
The current conversion of skb->data_end reads like this: ; data_end = (void*)(long)skb->data_end; 559: (79) r1 = *(u64 *)(r2 +200) ; r1 = skb->data 560: (61) r11 = *(u32 *)(r2 +112) ; r11 = skb->len 561: (0f) r1 += r11 562: (61) r11 = *(u32 *)(r2 +116) 563: (1f) r1 -= r11 But similar to the case in 84f44df664e9 ("bpf: sock_ops sk access may stomp registers when dst_reg = src_reg"), the code will read an incorrect skb->len when src == dst. In this case we end up generating this xlated code: ; data_end = (void*)(long)skb->data_end; 559: (79) r1 = *(u64 *)(r1 +200) ; r1 = skb->data 560: (61) r11 = *(u32 *)(r1 +112) ; r11 = (skb->data)->len 561: (0f) r1 += r11 562: (61) r11 = *(u32 *)(r1 +116) 563: (1f) r1 -= r11 ... where line 560 is the reading 4B of (skb->data + 112) instead of the intended skb->len Here the skb pointer in r1 gets set to skb->data and the later deref for skb->len ends up following skb->data instead of skb. This fixes the issue similarly to the patch mentioned above by creating an additional temporary variable and using to store the register when dst_reg = src_reg. We name the variable bpf_temp_reg and place it in the cb context for sk_skb. Then we restore from the temp to ensure nothing is lost. Fixes: 16137b09a66f2 ("bpf: Compute data_end dynamically with JIT code") Signed-off-by: Jussi Maki <[email protected]> Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]