aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-04-11mm/swapfile.c: make pointer swap_avail_heads staticColin Ian King1-1/+1
The pointer swap_avail_heads is local to the source and does not need to be in global scope, so make it static. Cleans up sparse warning: mm/swapfile.c:88:19: warning: symbol 'swap_avail_heads' was not declared. Should it be static? Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: "Huang, Ying" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11memcg: fix per_node_info cleanupMichal Hocko1-0/+3
syzbot has triggered a NULL ptr dereference when allocation fault injection enforces a failure and alloc_mem_cgroup_per_node_info initializes memcg->nodeinfo only half way through. But __mem_cgroup_free still tries to free all per-node data and dereferences pn->lruvec_stat_cpu unconditioanlly even if the specific per-node data hasn't been initialized. The bug is quite unlikely to hit because small allocations do not fail and we would need quite some numa nodes to make struct mem_cgroup_per_node large enough to cross the costly order. Link: http://lkml.kernel.org/r/[email protected] Reported-by: [email protected] Fixes: 00f3ca2c2d66 ("mm: memcontrol: per-lruvec stats infrastructure") Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Andrey Ryabinin <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11swap: divide-by-zero when zero length swap file on ssdTom Abraham1-0/+4
Calling swapon() on a zero length swap file on SSD can lead to a divide-by-zero. Although creating such files isn't possible with mkswap and they woud be considered invalid, it would be better for the swapon code to be more robust and handle this condition gracefully (return -EINVAL). Especially since the fix is small and straightforward. To help with wear leveling on SSD, the swapon syscall calculates a random position in the swap file using modulo p->highest_bit, which is set to maxpages - 1 in read_swap_header. If the swap file is zero length, read_swap_header sets maxpages=1 and last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we modulo p->highest_bit in swapon syscall. This can be prevented by having read_swap_header return zero if last_page is zero. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Abraham <[email protected]> Reported-by: <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm: memcg: make sure memory.events is uptodate when waking pollersJohannes Weiner3-30/+35
Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") added per-cpu drift to all memory cgroup stats and events shown in memory.stat and memory.events. For memory.stat this is acceptable. But memory.events issues file notifications, and somebody polling the file for changes will be confused when the counters in it are unchanged after a wakeup. Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX, MEMCG_OOM - are sufficiently rare and high-level that we don't need per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most frequent, but they're counting invocations of reclaim, which is a complex operation that touches many shared cachelines. This splits memory.events from the generic VM events and tracks them in their own, unbuffered atomic counters. That's also cleaner, as it eliminates the ugly enum nesting of VM and cgroup events. [[email protected]: "array subscript is above array bounds"] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Tejun Heo <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/ksm.c: fix inconsistent accounting of zero pagesClaudio Imbrenda1-0/+7
When using KSM with use_zero_pages, we replace anonymous pages containing only zeroes with actual zero pages, which are not anonymous. We need to do proper accounting of the mm counters, otherwise we will get wrong values in /proc and a BUG message in dmesg when tearing down the mm. Link: http://lkml.kernel.org/r/[email protected] Fixes: e86c59b1b1 ("mm/ksm: improve deduplication of zero pages with colouring") Signed-off-by: Claudio Imbrenda <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/z3fold.c: use gfpflags_allow_blockingMatthew Wilcox1-1/+1
We have a perfectly good macro to determine whether the gfp flags allow you to sleep or not; use it instead of trying to infer it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11z3fold: fix memory leakXidong Wang1-2/+7
In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not released on the error path that pool->compact_wq , which holds the return value of create_singlethread_workqueue(), is NULL. This will result in a memory leak bug. [[email protected]: fix oops on kzalloc() failure, check __alloc_percpu() retval] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Xidong Wang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11memcg, thp: do not invoke oom killer on thp chargesMichal Hocko3-10/+5
A THP memcg charge can trigger the oom killer since 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations"). We have used an explicit __GFP_NORETRY previously which ruled the OOM killer automagically. Memcg charge path should be semantically compliant with the allocation path and that means that if we do not trigger the OOM killer for costly orders which should do the same in the memcg charge path as well. Otherwise we are forcing callers to distinguish the two and use different gfp masks which is both non-intuitive and bug prone. As soon as we get a costly high order kmalloc user we even do not have any means to tell the memcg specific gfp mask to prevent from OOM because the charging is deep within guts of the slab allocator. The unexpected memcg OOM on THP has already been fixed upstream by 9d3c3354bb85 ("mm, thp: do not cause memcg oom for thp") but this is a one-off fix rather than a generic solution. Teach mem_cgroup_oom to bail out on costly order requests to fix the THP issue as well as any other costly OOM eligible allocations to be added in future. Also revert 9d3c3354bb85 because special gfp for THP is no longer needed. Link: http://lkml.kernel.org/r/[email protected] Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations") Signed-off-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/migrate: properly preserve write attribute in special migrate entryRalph Campbell1-1/+2
Use of pte_write(pte) is only valid for present pte, the common code which set the migration entry can be reach for both valid present pte and special swap entry (for device memory). Fix the code to use the mpfn value which properly handle both cases. On x86 this did not have any bad side effect because pte write bit is below PAGE_BIT_GLOBAL and thus special swap entry have it set to 0 which in turn means we were always creating read only special migration entry. So once migration did finish we always write protected the CPU page table entry (moreover this is only an issue when migrating from device memory to system memory). End effect is that CPU write access would fault again and restore write permission. This behaviour isn't too bad; it just burns CPU cycles by forcing CPU to take a second fault on write access. ie, double faulting the same address. There is no corruption or incorrect states (it behaves as a COWed page from a fork with a mapcount of 1). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ralph Campbell <[email protected]> Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm: check __highest_present_section_nr directly in memory_dev_init()Wei Yang1-5/+2
__highest_present_section_nr is a more strict boundary than NR_MEM_SECTIONS. So checking __highest_present_section_nr directly is enough. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11sched/numa: avoid trapping faults and attempting migration of file-backed ↵Mel Gorman2-0/+16
dirty pages change_pte_range is called from task work context to mark PTEs for receiving NUMA faulting hints. If the marked pages are dirty then migration may fail. Some filesystems cannot migrate dirty pages without blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU. Even when they can, it can be a waste of cycles when the pages are shared forcing higher scan rates. This patch avoids marking shared dirty pages for hinting faults but also will skip a migration if the page was dirtied after the scanner updated a clean page. This is most noticeable running the NASA Parallel Benchmark when backed by btrfs, the default root filesystem for some distributions, but also noticeable when using XFS. The following are results from a 4-socket machine running a 4.16-rc4 kernel with some scheduler patches that are pending for the next merge window. 4.16.0-rc4 4.16.0-rc4 schedtip-20180309 nodirty-v1 Time cg.D 459.07 ( 0.00%) 444.21 ( 3.24%) Time ep.D 76.96 ( 0.00%) 77.69 ( -0.95%) Time is.D 25.55 ( 0.00%) 27.85 ( -9.00%) Time lu.D 601.58 ( 0.00%) 596.87 ( 0.78%) Time mg.D 107.73 ( 0.00%) 108.22 ( -0.45%) is.D regresses slightly in terms of absolute time but note that that particular load varies quite a bit from run to run. The more relevant observation is the total system CPU usage. 4.16.0-rc4 4.16.0-rc4 schedtip-20180309 nodirty-v1 User 71471.91 70627.04 System 11078.96 8256.13 Elapsed 661.66 632.74 That is a substantial drop in system CPU usage and overall the workload completes faster. The NUMA balancing statistics are also interesting NUMA base PTE updates 111407972 139848884 NUMA huge PMD updates 206506 264869 NUMA page range updates 217139044 275461812 NUMA hint faults 4300924 3719784 NUMA hint local faults 3012539 3416618 NUMA hint local percent 70 91 NUMA pages migrated 1517487 1358420 While more PTEs are scanned due to changes in what faults are gathered, it's clear that a far higher percentage of faults are local as the bulk of the remote hits were dirty pages that, in this case with btrfs, had no chance of migrating. The following is a comparison when using XFS as that is a more realistic filesystem choice for a data partition 4.16.0-rc4 4.16.0-rc4 schedtip-20180309 nodirty-v1r47 Time cg.D 485.28 ( 0.00%) 442.62 ( 8.79%) Time ep.D 77.68 ( 0.00%) 77.54 ( 0.18%) Time is.D 26.44 ( 0.00%) 24.79 ( 6.24%) Time lu.D 597.46 ( 0.00%) 597.11 ( 0.06%) Time mg.D 142.65 ( 0.00%) 105.83 ( 25.81%) That is a reasonable gain on two relatively long-lived workloads. While not presented, there is also a substantial drop in system CPu usage and the NUMA balancing stats show similar improvements in locality as btrfs did. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11Documentation/vm/hmm.txt: typos and syntaxes fixesJérôme Glisse1-54/+54
This fix typos and syntaxes, thanks to Randy Dunlap for pointing them out (they were all my faults). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: Ralph Campbell <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: fix header file if/else/endif maze, againArnd Bergmann1-9/+12
The last fix was still wrong, as we need the inline dummy functions also for the case that CONFIG_HMM is enabled but CONFIG_HMM_MIRROR is not: kernel/fork.o: In function `__mmdrop': fork.c:(.text+0x14f6): undefined reference to `hmm_mm_destroy' This adds back the second copy of the dummy functions, hopefully this time in the right place. Link: http://lkml.kernel.org/r/[email protected] Fixes: 8900d06a277a ("mm/hmm: fix header file if/else/endif maze") Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Jérôme Glisse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm.c: remove superfluous RCU protection around radix tree lookupTejun Heo1-10/+2
hmm_devmem_find() requires rcu_read_lock_held() but there's nothing which actually uses the RCU protection. The only caller is hmm_devmem_pages_create() which already grabs the mutex and does superfluous rcu_read_lock/unlock() around the function. This doesn't add anything and just adds to confusion. Remove the RCU protection and open-code the radix tree lookup. If this needs to become more sophisticated in the future, let's add them back when necessary. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Jérôme Glisse <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Benjamin LaHaise <[email protected]> Cc: Al Viro <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: use device driver encoding for HMM pfnJérôme Glisse2-77/+152
Users of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array and pfn shift value allowing them to define their own encoding for HMM pfn that are fill inside the pfns array of the hmm_range struct. With this device driver can get pfn that match their own private encoding out of HMM without having to do any conversion. [[email protected]: don't ignore specific pte fault flag in hmm_vma_fault()] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: clarify fault logic for device private memory] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Ralph Campbell <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: change hmm_vma_fault() to allow write fault on page basisJérôme Glisse2-34/+119
This changes hmm_vma_fault() to not take a global write fault flag for a range but instead rely on caller to populate HMM pfns array with proper fault flag ie HMM_PFN_VALID if driver want read fault for that address or HMM_PFN_VALID and HMM_PFN_WRITE for write. Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can ask for device private memory to be migrated back to system memory through page fault. This is more flexible API and it better reflects how device handles and reports fault. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()Jérôme Glisse1-72/+102
No functional change, just create one function to handle pmd and one to handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: move hmm_pfns_clear() closer to where it is usedJérôme Glisse1-8/+8
Move hmm_pfns_clear() closer to where it is used to make it clear it is not use by page table walkers. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATEJérôme Glisse2-3/+3
Make naming consistent across code, DEVICE_PRIVATE is the name use outside HMM code so use that one. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: do not differentiate between empty entry or missing directoryJérôme Glisse2-35/+18
There is no point in differentiating between a range for which there is not even a directory (and thus entries) and empty entry (pte_none() or pmd_none() returns true). Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: cleanup special vma handling (VM_SPECIAL)Jérôme Glisse1-20/+20
Special vma (one with any of the VM_SPECIAL flags) can not be access by device because there is no consistent model across device drivers on those vma and their backing memory. This patch directly use hmm_range struct for hmm_pfns_special() argument as it is always affecting the whole vma and thus the whole range. It also make behavior consistent after this patch both hmm_vma_fault() and hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but both were filling the HMM pfn array with special entry. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulongJérôme Glisse2-38/+34
All device driver we care about are using 64bits page table entry. In order to match this and to avoid useless define convert all HMM pfn to directly use uint64_t. It is a first step on the road to allow driver to directly use pfn value return by HMM (saving memory and CPU cycles use for conversion between the two). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architectureJérôme Glisse2-19/+41
Only peculiar architecture allow write without read thus assume that any valid pfn do allow for read. Note we do not care for write only because it does make sense with thing like atomic compare and exchange or any other operations that allow you to get the memory value through them. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parametersJérôme Glisse2-63/+33
Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range struct as parameter and were initializing that struct with others of their parameters. Have caller of those function do this as they are likely to already do and only pass this struct to both function this shorten function signature and make it easier in the future to add new parameters by simply adding them to the structure. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: hmm_pfns_bad() was accessing wrong structJérôme Glisse1-1/+2
The private field of mm_walk struct point to an hmm_vma_walk struct and not to the hmm_range struct desired. Fix to get proper struct pointer. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: John Hubbard <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: unregister mmu_notifier when last HMM client quitJérôme Glisse1-3/+35
This code was lost in translation at one point. This properly call mmu_notifier_unregister_no_release() once last user is gone. This fix the zombie mm_struct as without this patch we do not drop the refcount we have on it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: HMM should have a callback before MM is destroyedRalph Campbell2-1/+38
hmm_mirror_register() registers a callback for when the CPU pagetable is modified. Normally, the device driver will call hmm_mirror_unregister() when the process using the device is finished. However, if the process exits uncleanly, the struct_mm can be destroyed with no warning to the device driver. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ralph Campbell <[email protected]> Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Mark Hairgrove <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: fix header file if/else/endif mazeJérôme Glisse1-8/+1
The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong. Because of this after multiple include there was multiple definition of both hmm_mm_init() and hmm_mm_destroy() leading to build failure if HMM was enabled (CONFIG_HMM set). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Acked-by: Balbir Singh <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: John Hubbard <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/hmm: documentation editorial update to HMM documentationRalph Campbell2-174/+187
Update the documentation for HMM to fix minor typos and phrasing to be a bit more readable. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ralph Campbell <[email protected]> Signed-off-by: Jérôme Glisse <[email protected]> Cc: Stephen Bates <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm, vmscan, tracing: use pointer to reclaim_stat struct in trace eventSteven Rostedt3-32/+21
The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12 parameters! Seven of them are from the reclaim_stat structure. This structure is currently local to mm/vmscan.c. By moving it to the global vmstat.h header, we can also reference it from the vmscan tracepoints. In moving it, it brings down the overhead of passing so many arguments to the trace event. In the future, we may limit the number of arguments that a trace event may pass (ideally just 6, but more realistically it may be 8). Before this patch, the code to call the trace event is this: 0f 83 aa fe ff ff jae ffffffff811e6261 <shrink_inactive_list+0x1e1> 48 8b 45 a0 mov -0x60(%rbp),%rax 45 8b 64 24 20 mov 0x20(%r12),%r12d 44 8b 6d d4 mov -0x2c(%rbp),%r13d 8b 4d d0 mov -0x30(%rbp),%ecx 44 8b 75 cc mov -0x34(%rbp),%r14d 44 8b 7d c8 mov -0x38(%rbp),%r15d 48 89 45 90 mov %rax,-0x70(%rbp) 8b 83 b8 fe ff ff mov -0x148(%rbx),%eax 8b 55 c0 mov -0x40(%rbp),%edx 8b 7d c4 mov -0x3c(%rbp),%edi 8b 75 b8 mov -0x48(%rbp),%esi 89 45 80 mov %eax,-0x80(%rbp) 65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0 <__preempt_count> 48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28> 48 85 c0 test %rax,%rax 74 72 je ffffffff811e646a <shrink_inactive_list+0x3ea> 48 89 c3 mov %rax,%rbx 4c 8b 10 mov (%rax),%r10 89 f8 mov %edi,%eax 48 89 85 68 ff ff ff mov %rax,-0x98(%rbp) 89 f0 mov %esi,%eax 48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp) 89 c8 mov %ecx,%eax 48 89 85 78 ff ff ff mov %rax,-0x88(%rbp) 89 d0 mov %edx,%eax 48 89 85 70 ff ff ff mov %rax,-0x90(%rbp) 8b 45 8c mov -0x74(%rbp),%eax 48 8b 7b 08 mov 0x8(%rbx),%rdi 48 83 c3 18 add $0x18,%rbx 50 push %rax 41 54 push %r12 41 55 push %r13 ff b5 78 ff ff ff pushq -0x88(%rbp) 41 56 push %r14 41 57 push %r15 ff b5 70 ff ff ff pushq -0x90(%rbp) 4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9 4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8 48 8b 4d 98 mov -0x68(%rbp),%rcx 48 8b 55 90 mov -0x70(%rbp),%rdx 8b 75 80 mov -0x80(%rbp),%esi 41 ff d2 callq *%r10 After the patch: 0f 83 a8 fe ff ff jae ffffffff811e626d <shrink_inactive_list+0x1cd> 8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx 45 8b 64 24 20 mov 0x20(%r12),%r12d 4c 8b 6d a0 mov -0x60(%rbp),%r13 65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0 <__preempt_count> 4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28> 4d 85 f6 test %r14,%r14 74 2a je ffffffff811e6411 <shrink_inactive_list+0x371> 49 8b 06 mov (%r14),%rax 8b 4d 8c mov -0x74(%rbp),%ecx 49 8b 7e 08 mov 0x8(%r14),%rdi 49 83 c6 18 add $0x18,%r14 4c 89 ea mov %r13,%rdx 45 89 e1 mov %r12d,%r9d 4c 8d 45 b8 lea -0x48(%rbp),%r8 89 de mov %ebx,%esi 51 push %rcx 48 8b 4d 98 mov -0x68(%rbp),%rcx ff d0 callq *%rax Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (VMware) <[email protected]> Reported-by: Alexei Starovoitov <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexei Starovoitov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/vmscan: don't mess with pgdat->flags in memcg reclaimAndrey Ryabinin4-38/+82
memcg reclaim may alter pgdat->flags based on the state of LRU lists in cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem pages. But the worst here is PGDAT_CONGESTED, since it may force all direct reclaims to stall in wait_iff_congested(). Note that only kswapd have powers to clear any of these bits. This might just never happen if cgroup limits configured that way. So all direct reclaims will stall as long as we have some congested bdi in the system. Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole pgdat, only kswapd can clear pgdat->flags once node is balanced, thus it's reasonable to leave all decisions about node state to kswapd. Why only kswapd? Why not allow to global direct reclaim change these flags? It is because currently only kswapd can clear these flags. I'm less worried about the case when PGDAT_CONGESTED falsely not set, and more worried about the case when it falsely set. If direct reclaimer sets PGDAT_CONGESTED, do we have guarantee that after the congestion problem is sorted out, kswapd will be woken up and clear the flag? It seems like there is no such guarantee. E.g. direct reclaimers may eventually balance pgdat and kswapd simply won't wake up (see wakeup_kswapd()). Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim now loses its congestion throttling mechanism. Add per-cgroup congestion state and throttle cgroup2 reclaimers if memcg is in congestion state. Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY bits since they alter only kswapd behavior. The problem could be easily demonstrated by creating heavy congestion in one cgroup: echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control mkdir -p /sys/fs/cgroup/congester echo 512M > /sys/fs/cgroup/congester/memory.max echo $$ > /sys/fs/cgroup/congester/cgroup.procs /* generate a lot of diry data on slow HDD */ while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done & .... while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done & and some job in another cgroup: mkdir /sys/fs/cgroup/victim echo 128M > /sys/fs/cgroup/victim/memory.max # time cat /dev/sda > /dev/null real 10m15.054s user 0m0.487s sys 1m8.505s According to the tracepoint in wait_iff_congested(), the 'cat' spent 50% of the time sleeping there. With the patch, cat don't waste time anymore: # time cat /dev/sda > /dev/null real 5m32.911s user 0m0.411s sys 0m56.664s [[email protected]: congestion state should be per-node] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: make congestion state per-cgroup-per-node instead of just per-cgroup[ Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/vmscan: don't change pgdat state on base of a single LRU list stateAndrey Ryabinin1-51/+75
We have separate LRU list for each memory cgroup. Memory reclaim iterates over cgroups and calls shrink_inactive_list() every inactive LRU list. Based on the state of a single LRU shrink_inactive_list() may flag the whole node as dirty,congested or under writeback. This is obviously wrong and hurtful. It's especially hurtful when we have possibly small congested cgroup in system. Than *all* direct reclaims waste time by sleeping in wait_iff_congested(). And the more memcgs in the system we have the longer memory allocation stall is, because wait_iff_congested() called on each lru-list scan. Sum reclaim stats across all visited LRUs on node and flag node as dirty, congested or under writeback based on that sum. Also call congestion_wait(), wait_iff_congested() once per pgdat scan, instead of once per lru-list scan. This only fixes the problem for global reclaim case. Per-cgroup reclaim may alter global pgdat flags too, which is wrong. But that is separate issue and will be addressed in the next patch. This change will not have any effect on a systems with all workload concentrated in a single cgroup. [[email protected]: check nr_writeback against all nr_taken, not just file] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/vmscan: remove redundant current_may_throttle() checkAndrey Ryabinin1-1/+1
Only kswapd can have non-zero nr_immediate, and current_may_throttle() is always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's enough to check stat.nr_immediate only. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm/vmscan: update stale commentsAndrey Ryabinin1-5/+5
Update some comments that became stale since transiton from per-zone to per-node reclaim. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm: treat indirectly reclaimable memory as free in overcommit logicRoman Gushchin1-0/+7
Indirectly reclaimable memory can consume a significant part of total memory and it's actually reclaimable (it will be released under actual memory pressure). So, the overcommit logic should treat it as free. Otherwise, it's possible to cause random system-wide memory allocation failures by consuming a significant amount of memory by indirectly reclaimable memory, e.g. dentry external names. If overcommit policy GUESS is used, it might be used for denial of service attack under some conditions. The following program illustrates the approach. It causes the kernel to allocate an unreclaimable kmalloc-256 chunk for each stat() call, so that at some point the overcommit logic may start blocking large allocation system-wide. int main() { char buf[256]; unsigned long i; struct stat statbuf; buf[0] = '/'; for (i = 1; i < sizeof(buf); i++) buf[i] = '_'; for (i = 0; 1; i++) { sprintf(&buf[248], "%8lu", i); stat(buf, &statbuf); } return 0; } This patch in combination with related indirectly reclaimable memory patches closes this issue. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11dcache: account external names as indirectly reclaimable memoryRoman Gushchin1-9/+30
I received a report about suspicious growth of unreclaimable slabs on some machines. I've found that it happens on machines with low memory pressure, and these unreclaimable slabs are external names attached to dentries. External names are allocated using generic kmalloc() function, so they are accounted as unreclaimable. But they are held by dentries, which are reclaimable, and they will be reclaimed under the memory pressure. In particular, this breaks MemAvailable calculation, as it doesn't take unreclaimable slabs into account. This leads to a silly situation, when a machine is almost idle, has no memory pressure and therefore has a big dentry cache. And the resulting MemAvailable is too low to start a new workload. To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is used to track the amount of memory, consumed by external names. The counter is increased in the dentry allocation path, if an external name structure is allocated; and it's decreased in the dentry freeing path. To reproduce the problem I've used the following Python script: import os for iter in range (0, 10000000): try: name = ("/some_long_name_%d" % iter) + "_" * 220 os.stat(name) except Exception: pass Without this patch: $ cat /proc/meminfo | grep MemAvailable MemAvailable: 7811688 kB $ python indirect.py $ cat /proc/meminfo | grep MemAvailable MemAvailable: 2753052 kB With the patch: $ cat /proc/meminfo | grep MemAvailable MemAvailable: 7809516 kB $ python indirect.py $ cat /proc/meminfo | grep MemAvailable MemAvailable: 7749144 kB [[email protected]: fix indirectly reclaimable memory accounting for CONFIG_SLOB] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: fix indirectly reclaimable memory accounting] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm: treat indirectly reclaimable memory as available in MemAvailableRoman Gushchin1-0/+7
Adjust /proc/meminfo MemAvailable calculation by adding the amount of indirectly reclaimable memory (rounded to the PAGE_SIZE). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTESRoman Gushchin2-0/+2
Patch series "indirectly reclaimable memory", v2. This patchset introduces the concept of indirectly reclaimable memory and applies it to fix the issue of when a big number of dentries with external names can significantly affect the MemAvailable value. This patch (of 3): Introduce a concept of indirectly reclaimable memory and adds the corresponding memory counter and /proc/vmstat item. Indirectly reclaimable memory is any sort of memory, used by the kernel (except of reclaimable slabs), which is actually reclaimable, i.e. will be released under memory pressure. The counter is in bytes, as it's not always possible to count such objects in pages. The name contains BYTES by analogy to NR_KERNEL_STACK_KB. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-11sr: get/drop reference to device in revalidate and check_eventsJens Axboe1-4/+15
We can't just use scsi_cd() to get the scsi_cd structure, we have to grab a live reference to the device. For both callbacks, we're not inside an open where we already hold a reference to the device. This fixes device removal/addition under concurrent device access, which otherwise could result in the below oops. NULL pointer dereference at 0000000000000010 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: sr 12:0:0:0: [sr2] scsi-1 drive scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme nvme_core sb_edac xl sr 12:0:0:0: Attached scsi CD-ROM sr2 sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash lzo_compress zlib_defc sr 12:0:0:0: Attached scsi generic sg7 type 5 igb ahci libahci i2c_algo_bit libata dca [last unloaded: crc_t10dif] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650 Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016 RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] RSP: 0018:ffff883ff357bb58 EFLAGS: 00010292 RAX: ffffffffa00b07d0 RBX: ffff883ff3058000 RCX: ffff883ff357bb66 RDX: 0000000000000003 RSI: 0000000000007530 RDI: ffff881fea631000 RBP: 0000000000000000 R08: ffff881fe4d38400 R09: 0000000000000000 R10: 0000000000000000 R11: 00000000000001b6 R12: 000000000800005d R13: 000000000800005d R14: ffff883ffd9b3790 R15: 0000000000000000 FS: 00007f7dc8e6d8c0(0000) GS:ffff883fff340000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 0000003ffda98005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? __invalidate_device+0x48/0x60 check_disk_change+0x4c/0x60 sr_block_open+0x16/0xd0 [sr_mod] __blkdev_get+0xb9/0x450 ? iget5_locked+0x1c0/0x1e0 blkdev_get+0x11e/0x320 ? bdget+0x11d/0x150 ? _raw_spin_unlock+0xa/0x20 ? bd_acquire+0xc0/0xc0 do_dentry_open+0x1b0/0x320 ? inode_permission+0x24/0xc0 path_openat+0x4e6/0x1420 ? cpumask_any_but+0x1f/0x40 ? flush_tlb_mm_range+0xa0/0x120 do_filp_open+0x8c/0xf0 ? __seccomp_filter+0x28/0x230 ? _raw_spin_unlock+0xa/0x20 ? __handle_mm_fault+0x7d6/0x9b0 ? list_lru_add+0xa8/0xc0 ? _raw_spin_unlock+0xa/0x20 ? __alloc_fd+0xaf/0x160 ? do_sys_open+0x1a6/0x230 do_sys_open+0x1a6/0x230 do_syscall_64+0x5a/0x100 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Reviewed-by: Lee Duncan <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-04-11s390: correct nospec auto detection init orderMartin Schwidefsky3-6/+6
With CONFIG_EXPOLINE_AUTO=y the call of spectre_v2_auto_early() via early_initcall is done *after* the early_param functions. This overwrites any settings done with the nobp/no_spectre_v2/spectre_v2 parameters. The code patching for the kernel is done after the evaluation of the early parameters but before the early_initcall is done. The end result is a kernel image that is patched correctly but the kernel modules are not. Make sure that the nospec auto detection function is called before the early parameters are evaluated and before the code patching is done. Fixes: 6e179d64126b ("s390: add automatic detection of the spectre defense") Signed-off-by: Martin Schwidefsky <[email protected]>
2018-04-11tracing: Enforce passing in filter=NULL to create_filter()Steven Rostedt (VMware)1-14/+10
There's some inconsistency with what to set the output parameter filterp when passing to create_filter(..., struct event_filter **filterp). Whatever filterp points to, should be NULL when calling this function. The create_filter() calls create_filter_start() with a pointer to a local "filter" variable that is set to NULL. The create_filter_start() has a WARN_ON() if the passed in pointer isn't pointing to a value set to NULL. Ideally, create_filter() should pass the filterp variable it received to create_filter_start() and not hide it as with a local variable, this allowed create_filter() to fail, and not update the passed in filter, and the caller of create_filter() then tried to free filter, which was never initialized to anything, causing memory corruption. Link: http://lkml.kernel.org/r/[email protected] Fixes: 80765597bc587 ("tracing: Rewrite filter logic to be simpler and faster") Reported-by: [email protected] Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-04-11trace_uprobe: Simplify probes_seq_show()Ravi Bangoria1-18/+3
Simplify probes_seq_show() function. No change in output before and after patch. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Ravi Bangoria <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-04-11trace_uprobe: Use %lx to display offsetRavi Bangoria1-1/+1
tu->offset is unsigned long, not a pointer, thus %lx should be used to print it, not the %px. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Acked-by: Masami Hiramatsu <[email protected]> Fixes: 0e4d819d0893 ("trace_uprobe: Display correct offset in uprobe_events") Suggested-by: Kees Cook <[email protected]> Signed-off-by: Ravi Bangoria <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-04-11tracing/uprobe: Add support for overlayfsHoward McLauchlan1-1/+1
uprobes cannot successfully attach to binaries located in a directory mounted with overlayfs. To verify, create directories for mounting overlayfs (upper,lower,work,merge), move some binary into merge/ and use readelf to obtain some known instruction of the binary. I used /bin/true and the entry instruction(0x13b0): $ mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merge $ cd /sys/kernel/debug/tracing $ echo 'p:true_entry PATH_TO_MERGE/merge/true:0x13b0' > uprobe_events $ echo 1 > events/uprobes/true_entry/enable This returns 'bash: echo: write error: Input/output error' and dmesg tells us 'event trace: Could not enable event true_entry' This change makes create_trace_uprobe() look for the real inode of a dentry. In the case of normal filesystems, this simplifies to just returning the inode. In the case of overlayfs(and similar fs) we will obtain the underlying dentry and corresponding inode, upon which uprobes can successfully register. Running the example above with the patch applied, we can see that the uprobe is enabled and will output to trace as expected. Link: http://lkml.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Howard McLauchlan <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-04-11tracing: Use ARRAY_SIZE() macro instead of open coding itJérémy Lefaure1-1/+1
It is useless to re-invent the ARRAY_SIZE macro so let's use it instead of DATA_CNT. Found with Coccinelle with the following semantic patch: @r depends on (org || report)@ type T; T[] E; position p; @@ ( (sizeof(E)@p /sizeof(*E)) | (sizeof(E)@p /sizeof(E[...])) | (sizeof(E)@p /sizeof(T)) ) Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérémy Lefaure <[email protected]> [ Removed useless include of kernel.h ] Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-04-11Merge branch 'vhost-fix-vhost_vq_access_ok-log-check'David S. Miller2-36/+38
Stefan Hajnoczi says: ==================== vhost: fix vhost_vq_access_ok() log check v3: * Rebased onto net/master and resolved conflict [DaveM] v2: * Rewrote the conditional to make the vq access check clearer [Linus] * Added Patch 2 to make the return type consistent and harder to misuse [Linus] The first patch fixes the vhost virtqueue access check which was recently broken. The second patch replaces the int return type with bool to prevent future bugs. ==================== Acked-by: Jason Wang <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-11vhost: return bool from *_access_ok() functionsStefan Hajnoczi2-35/+35
Currently vhost *_access_ok() functions return int. This is error-prone because there are two popular conventions: 1. 0 means failure, 1 means success 2. -errno means failure, 0 means success Although vhost mostly uses #1, it does not do so consistently. umem_access_ok() uses #2. This patch changes the return type from int to bool so that false means failure and true means success. This eliminates a potential source of errors. Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Stefan Hajnoczi <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-11vhost: fix vhost_vq_access_ok() log checkStefan Hajnoczi1-3/+5
Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log when IOTLB is enabled") introduced a regression. The logic was originally: if (vq->iotlb) return 1; return A && B; After the patch the short-circuit logic for A was inverted: if (A || vq->iotlb) return A; return B; This patch fixes the regression by rewriting the checks in the obvious way, no longer returning A when vq->iotlb is non-NULL (which is hard to understand). Reported-by: [email protected] Cc: Jason Wang <[email protected]> Signed-off-by: Stefan Hajnoczi <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-11vhost: Fix vhost_copy_to_user()Eric Auger1-1/+1
vhost_copy_to_user is used to copy vring used elements to userspace. We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC. Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache") Signed-off-by: Eric Auger <[email protected]> Acked-by: Jason Wang <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-11Merge branch 'Aquantia-atlantic-critical-fixes-04-2018'David S. Miller2-3/+21
Igor Russkikh says: ==================== Aquantia atlantic critical fixes 04/2018 Two regressions on latest 4.16 driver reported by users Some of old FW (1.5.44) had a link management logic which prevents driver to make clean reset. Driver of 4.16 has a full hardware reset implemented and that broke the link and traffic on such a cards. Second is oops on shutdown callback in case interface is already closed or was never opened. ==================== Signed-off-by: David S. Miller <[email protected]>