aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-08-22mm: /proc/pid/smaps: factor out common stats printingVlastimil Babka1-22/+29
To prepare for handling /proc/pid/smaps_rollup differently from /proc/pid/smaps factor out from show_smap() printing the parts of output that are common for both variants, which is the bulk of the gathered memory stats. [[email protected]: add const, per Alexey] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Alexey Dobriyan <[email protected]> Cc: Daniel Colascione <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm: /proc/pid/smaps: factor out mem stats gatheringVlastimil Babka1-24/+31
To prepare for handling /proc/pid/smaps_rollup differently from /proc/pid/smaps factor out vma mem stats gathering from show_smap() - it will be used by both. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Alexey Dobriyan <[email protected]> Cc: Daniel Colascione <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm: /proc/pid/*maps remove is_pid and related wrappersVlastimil Babka4-144/+18
Patch series "cleanups and refactor of /proc/pid/smaps*". The recent regression in /proc/pid/smaps made me look more into the code. Especially the issues with smaps_rollup reported in [1] as explained in Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are preparations for that. Patch 1 is me realizing that there's a lot of boilerplate left from times where we tried (unsuccessfuly) to mark thread stacks in the output. Originally I had also plans to rework the translation from /proc/pid/*maps* file offsets to the internal structures. Now the offset means "vma number", which is not really stable (vma's can come and go between read() calls) and there's an extra caching of last vma's address. My idea was that offsets would be interpreted directly as addresses, which would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long long so that might be insufficient somewhere for the unsigned long addresses. So the result is fixed issues with skewed /proc/pid/smaps_rollup results, simpler smaps code, and a lot of unused code removed. [1] https://marc.info/?l=linux-mm&m=151927723128134&w=2 This patch (of 4): Commit b76437579d13 ("procfs: mark thread stack correctly in proc/<pid>/maps") introduced differences between /proc/PID/maps and /proc/PID/task/TID/maps to mark thread stacks properly, and this was also done for smaps and numa_maps. However it didn't work properly and was ultimately removed by commit b18cb64ead40 ("fs/proc: Stop trying to report thread stacks"). Now the is_pid parameter for the related show_*() functions is unused and we can remove it together with wrapper functions and ops structures that differ for PID and TID cases only in this parameter. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Alexey Dobriyan <[email protected]> Cc: Daniel Colascione <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm/oom_kill.c: clean up oom_reap_task_mm()Michal Hocko1-8/+16
Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment "failed to reap part..." is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Link: http://lkml.kernel.org/r/[email protected] Suggested-by: Andrew Morton <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm, oom: describe task memory unit, larger PID padRodrigo Freire1-2/+3
The default page memory unit of OOM task dump events might not be intuitive and potentially misleading for the non-initiated when debugging OOM events: These are pages and not kBs. Add a small printk prior to the task dump informing that the memory units are actually memory _pages_. Also extends PID field to align on up to 7 characters. Reference https://lkml.org/lkml/2018/7/3/1201 Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com Signed-off-by: Rodrigo Freire <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Rafael Aquini <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm, oom: remove oom_lock from oom_reaperMichal Hocko2-28/+4
oom_reaper used to rely on the oom_lock since e2fe14564d33 ("oom_reaper: close race with exiting task"). We do not really need the lock anymore though. 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [[email protected]: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Suggested-by: David Rientjes <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm, oom: distinguish blockable mode for mmu notifiersMichal Hocko19-80/+223
There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [[email protected]: coding style fixes] [[email protected]: minor code simplification] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Christian König <[email protected]> # AMD notifiers Acked-by: Leon Romanovsky <[email protected]> # mlx and umem_odp Reported-by: David Rientjes <[email protected]> Cc: "David (ChunMing) Zhou" <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Alex Deucher <[email protected]> Cc: David Airlie <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Mike Marciniszyn <[email protected]> Cc: Dennis Dalessandro <[email protected]> Cc: Sudeep Dutt <[email protected]> Cc: Ashutosh Dixit <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Juergen Gross <[email protected]> Cc: "Jérôme Glisse" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Felix Kuehling <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm/swapfile.c: put_swap_page: share more between huge/normal code pathHuang Ying1-10/+10
In this patch, locking related code is shared between huge/normal code path in put_swap_page() to reduce code duplication. The `free_entries == 0` case is merged into the more general `free_entries != SWAPFILE_CLUSTER` case, because the new locking method makes it easy. The added lines is same as the removed lines. But the code size is increased when CONFIG_TRANSPARENT_HUGEPAGE=n. text data bss dec hex filename base: 24123 2004 340 26467 6763 mm/swapfile.o unified: 24485 2004 340 26829 68cd mm/swapfile.o Dig on step deeper with `size -A mm/swapfile.o` for base and unified kernel and compare the result, yields, -.text 17723 0 +.text 17835 0 -.orc_unwind_ip 1380 0 +.orc_unwind_ip 1480 0 -.orc_unwind 2070 0 +.orc_unwind 2220 0 -Total 26686 +Total 27048 The total difference is the same. The text segment difference is much smaller: 112. More difference comes from the ORC unwinder segments: (1480 + 2220) - (1380 + 2070) = 250. If the frame pointer unwinder is used, this costs nothing. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Acked-by: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dan Williams <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm/swapfile.c: add __swap_entry_free_locked()Huang Ying1-6/+14
The part of __swap_entry_free() with lock held is separated into a new function __swap_entry_free_locked(). Because we want to reuse that piece of code in some other places. Just mechanical code refactoring, there is no any functional change in this function. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Acked-by: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dan Williams <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm, swap, get_swap_pages: use entry_size instead of cluster in parameterHuang Ying3-13/+13
As suggested by Matthew Wilcox, it is better to use "int entry_size" instead of "bool cluster" as parameter to specify whether to operate for huge or normal swap entries. Because this improve the flexibility to support other swap entry size. And Dave Hansen thinks that this improves code readability too. So in this patch, the "bool cluster" parameter of get_swap_pages() is replaced by "int entry_size". And nr_swap_entries() trick is used to reduce the binary size when !CONFIG_TRANSPARENT_HUGE_PAGE. text data bss dec hex filename base 24215 2028 340 26583 67d7 mm/swapfile.o head 24123 2004 340 26467 6763 mm/swapfile.o Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Acked-by: Dave Hansen <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm/swapfile.c: unify normal/huge code path in put_swap_page()Huang Ying1-46/+37
In this patch, the normal/huge code path in put_swap_page() and several helper functions are unified to avoid duplicated code, bugs, etc. and make it easier to review the code. The removed lines are more than added lines. And the binary size is kept exactly same when CONFIG_TRANSPARENT_HUGEPAGE=n. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Suggested-by: Dave Hansen <[email protected]> Acked-by: Dave Hansen <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dan Williams <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm/swapfile.c: unify normal/huge code path in swap_page_trans_huge_swapped()Huang Ying1-4/+3
As suggested by Dave, we should unify the code path for normal and huge swap support if possible to avoid duplicated code, bugs, etc. and make it easier to review code. In this patch, the normal/huge code path in swap_page_trans_huge_swapped() is unified, the added and removed lines are same. And the binary size is kept almost same when CONFIG_TRANSPARENT_HUGEPAGE=n. text data bss dec hex filename base: 24179 2028 340 26547 67b3 mm/swapfile.o unified: 24215 2028 340 26583 67d7 mm/swapfile.o Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Suggested-and-acked-by: Dave Hansen <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dan Williams <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm/swapfile.c: use swap_count() in swap_page_trans_huge_swapped()Huang Ying1-2/+2
In swap_page_trans_huge_swapped(), to identify whether there's any page table mapping for a 4k sized swap entry, "si->swap_map[i] != SWAP_HAS_CACHE" is used. This works correctly now, because all users of the function will only call it after checking SWAP_HAS_CACHE. But as pointed out by Daniel, it is better to use "swap_count(map[i])" here, because it works for "map[i] == 0" case too. And this makes the implementation more consistent between normal and huge swap entry. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Suggested-and-reviewed-by: Daniel Jordan <[email protected]> Acked-by: Dave Hansen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dan Williams <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm/swapfile.c: replace some #ifdef with IS_ENABLED()Huang Ying1-40/+20
In mm/swapfile.c, THP (Transparent Huge Page) swap specific code is enclosed by #ifdef CONFIG_THP_SWAP/#endif to avoid code dilating when THP isn't enabled. But #ifdef/#endif in .c file hurt the code readability, so Dave suggested to use IS_ENABLED(CONFIG_THP_SWAP) instead and let compiler to do the dirty job for us. This has potential to remove some duplicated code too. From output of `size`, text data bss dec hex filename THP=y: 26269 2076 340 28685 700d mm/swapfile.o ifdef/endif: 24115 2028 340 26483 6773 mm/swapfile.o IS_ENABLED: 24179 2028 340 26547 67b3 mm/swapfile.o IS_ENABLED() based solution works quite well, almost as good as that of #ifdef/#endif. And from the diffstat, the removed lines are more than added lines. One #ifdef for split_swap_cluster() is kept. Because it is a public function with a stub implementation for CONFIG_THP_SWAP=n in swap.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Suggested-and-acked-by: Dave Hansen <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dan Williams <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm: swap: add comments to lock_cluster_or_swap_info()Huang Ying1-2/+7
Patch series "swap: THP optimizing refactoring", v4. Now the THP (Transparent Huge Page) swap optimizing is implemented in the way like below, #ifdef CONFIG_THP_SWAP huge_function(...) { } #else normal_function(...) { } #endif general_function(...) { if (huge) return thp_function(...); else return normal_function(...); } As pointed out by Dave Hansen, this will, 1. Create a new, wholly untested code path for huge page 2. Create two places to patch bugs 3. Are not reusing code when possible This patchset is to address these problems via merging huge/normal code path/functions if possible. One concern is that this may cause code size to dilate when !CONFIG_TRANSPARENT_HUGEPAGE. The data shows that most refactoring will only cause quite slight code size increase. This patch (of 8): To improve code readability. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Suggested-and-acked-by: Dave Hansen <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dan Williams <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm: struct shrinker: make flags of unsigned typeKirill Tkhai1-2/+2
Currently, there are two flags only, so unsigned is more then enough. Also, move int seeks to keep these fields together. Link: http://lkml.kernel.org/r/153199748720.21131.6476256940113102483.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Chris Wilson <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm: struct shrink_control: keep int fields togetherKirill Tkhai1-3/+3
Patch series "Reorderings in struct shrinker and struct shrink_control". These structures are intensively used during reclaim and, displace other data in cache, so there is no a reason they have int fields not grouped together. This patch (of 2): gfp_t is of unsigned type, so let's move nid to keep them together. Link: http://lkml.kernel.org/r/153199747930.21131.861043607301997810.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Chris Wilson <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm: check shrinker is memcg-aware in register_shrinker_prepared()Kirill Tkhai1-1/+2
There is a sad BUG introduced in patch adding SHRINKER_REGISTERING. shrinker_idr business is only for memcg-aware shrinkers. Only such type of shrinkers have id and they must be finaly installed via idr_replace() in this function. For !memcg-aware shrinkers we never initialize shrinker->id field. But there are all types of shrinkers passed to idr_replace(), and every !memcg-aware shrinker with random ID (most probably, its id is 0) replaces memcg-aware shrinker pointed by the ID in IDR. This patch fixes the problem. Link: http://lkml.kernel.org/r/[email protected] Fixes: 7e010df53c80 "mm: use special value SHRINKER_REGISTERING instead of list_empty() check" Signed-off-by: Kirill Tkhai <[email protected]> Reported-by: <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22autofs: fix autofs_sbi() does not check super block typeIan Kent2-2/+3
autofs_sbi() does not check the superblock magic number to verify it has been given an autofs super block. Link: http://lkml.kernel.org/r/[email protected] Reported-by: <[email protected]> Signed-off-by: Ian Kent <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22workqueue: re-add lockdep dependencies for flushingJohannes Berg1-0/+8
In flush_work(), we need to create a lockdep dependency so that the following scenario is appropriately tagged as a problem: work_function() { mutex_lock(&mutex); ... } other_function() { mutex_lock(&mutex); flush_work(&work); // or cancel_work_sync(&work); } This is a problem since the work might be running and be blocked on trying to acquire the mutex. Similarly, in flush_workqueue(). These were removed after cross-release partially caught these problems, but now cross-release was reverted anyway. IMHO the removal was erroneous anyway though, since lockdep should be able to catch potential problems, not just actual ones, and cross-release would only have caught the problem when actually invoking wait_for_completion(). Fixes: fd1a5b04dfb8 ("workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes") Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2018-08-22workqueue: skip lockdep wq dependency in cancel_work_sync()Johannes Berg1-15/+22
In cancel_work_sync(), we can only have one of two cases, even with an ordered workqueue: * the work isn't running, just cancelled before it started * the work is running, but then nothing else can be on the workqueue before it Thus, we need to skip the lockdep workqueue dependency handling, otherwise we get false positive reports from lockdep saying that we have a potential deadlock when the workqueue also has other work items with locking, e.g. work1_function() { mutex_lock(&mutex); ... } work2_function() { /* nothing */ } other_function() { queue_work(ordered_wq, &work1); queue_work(ordered_wq, &work2); mutex_lock(&mutex); cancel_work_sync(&work2); } As described above, this isn't a problem, but lockdep will currently flag it as if cancel_work_sync() was flush_work(), which *is* a problem. Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2018-08-22ata: ahci_platform: enable to get and control resetKunihiko Hayashi1-1/+2
Unlike SoC-specific driver, generic ahci_platform driver doesn't have any chances to control resets. This adds AHCI_PLATFORM_GET_RESETS to ahci_platform_get_resources() on the generic driver to enable reset control support. Suggested-by: Hans de Goede <[email protected]> Cc: Thierry Reding <[email protected]> Signed-off-by: Kunihiko Hayashi <[email protected]> Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2018-08-22ata: libahci_platform: add reset control supportKunihiko Hayashi4-5/+30
Add support to get and control a list of resets for the device as optional and shared. These resets must be kept de-asserted until the device is enabled. This is specified as shared because some SoCs like UniPhier series have common reset controls with all ahci controller instances. However, according to Thierry's view, https://www.spinics.net/lists/linux-ide/msg55357.html some hardware-specific drivers already use their own resets, and the common reset make a path to occur double controls of resets. The ahci_platform_get_resources() can get and control the reset only when the second argument includes AHCI_PLATFORM_GET_RESETS bit. Suggested-by: Hans de Goede <[email protected]> Cc: Thierry Reding <[email protected]> Signed-off-by: Kunihiko Hayashi <[email protected]> Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2018-08-22ata: add an extra argument to ahci_platform_get_resources()Kunihiko Hayashi16-16/+18
Add an extra argument to ahci_platform_get_resources(), that is for the bitmap representing the resource to get in this function. Currently there is no resources to be defined, so all the callers set '0' to the argument. Suggested-by: Hans de Goede <[email protected]> Cc: Thierry Reding <[email protected]> Cc: Matthias Brugger <[email protected]> Cc: Patrice Chotard <[email protected]> Cc: Maxime Ripard <[email protected]> Signed-off-by: Kunihiko Hayashi <[email protected]> Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2018-08-22KVM: VMX: fixes for vmentry_l1d_flush module parameterPaolo Bonzini1-10/+16
Two bug fixes: 1) missing entries in the l1d_param array; this can cause a host crash if an access attempts to reach the missing entry. Future-proof the get function against any overflows as well. However, the two entries VMENTER_L1D_FLUSH_EPT_DISABLED and VMENTER_L1D_FLUSH_NOT_REQUIRED must not be accepted by the parse function, so disable them there. 2) invalid values must be rejected even if the CPU does not have the bug, so test for them before checking boot_cpu_has(X86_BUG_L1TF) ... and a small refactoring, since the .cmd field is redundant with the index in the array. Reported-by: Bandan Das <[email protected]> Cc: [email protected] Fixes: a7b9020b06ec6d7c3f3b0d4ef1a9eba12654f4f7 Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-22kvm: selftest: add dirty logging testPeter Xu4-0/+356
Test KVM dirty logging functionality. The test creates a standalone memory slot to test tracking the dirty pages since we can't really write to the default memory slot which still contains the guest ELF image. We have two threads running during the test: (1) the vcpu thread continuously dirties random guest pages by writting a iteration number to the first 8 bytes of the page (2) the host thread continuously fetches dirty logs for the testing memory region and verify each single bit of the dirty bitmap by checking against the values written onto the page Note that since the guest cannot calls the general userspace APIs like random(), it depends on the host to provide random numbers for the page indexes to dirty. Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-22kvm: selftest: pass in extra memory when create vmPeter Xu7-8/+23
This information can be used to decide the size of the default memory slot, which will need to cover the extra pages with page tables. Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-22kvm: selftest: include the tools headersPeter Xu3-3/+3
Let the kvm selftest include the tools headers, then we can start to use things there like bitmap operations. Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-22kvm: selftest: unify the guest port macrosPeter Xu6-95/+78
Most of the tests are using the same way to do guest to host sync but the code is mostly duplicated. Generalize the guest port macros into the common header file and use it in different tests. Meanwhile provide "struct guest_args" and a helper "guest_args_read()" to hide the register details when playing with these port operations on RDI and RSI. Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-22tools: introduce test_and_clear_bitPeter Xu1-0/+17
We have test_and_set_bit but not test_and_clear_bit. Add it. Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-22KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabledThomas Gleixner1-4/+4
Mikhail reported the following lockdep splat: WARNING: possible irq lock inversion dependency detected CPU 0/KVM/10284 just changed the state of lock: 000000000d538a88 (&st->lock){+...}, at: speculative_store_bypass_update+0x10b/0x170 but this lock was taken by another, HARDIRQ-safe lock in the past: (&(&sighand->siglock)->rlock){-.-.} and interrupts could create inverse lock ordering between them. Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&st->lock); local_irq_disable(); lock(&(&sighand->siglock)->rlock); lock(&st->lock); <Interrupt> lock(&(&sighand->siglock)->rlock); *** DEADLOCK *** The code path which connects those locks is: speculative_store_bypass_update() ssb_prctl_set() do_seccomp() do_syscall_64() In svm_vcpu_run() speculative_store_bypass_update() is called with interupts enabled via x86_virt_spec_ctrl_set_guest/host(). This is actually a false positive, because GIF=0 so interrupts are disabled even if IF=1; however, we can easily move the invocations of x86_virt_spec_ctrl_set_guest/host() into the interrupt disabled region to cure it, and it's a good idea to keep the GIF=0/IF=1 area as small and self-contained as possible. Fixes: 1f50ddb4f418 ("x86/speculation: Handle HT correctly on AMD") Reported-by: Mikhail Gavrilov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Mikhail Gavrilov <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-22KVM: vmx: Inject #UD for SGX ENCLS instruction in guestSean Christopherson1-1/+29
Virtualization of Intel SGX depends on Enclave Page Cache (EPC) management that is not yet available in the kernel, i.e. KVM support for exposing SGX to a guest cannot be added until basic support for SGX is upstreamed, which is a WIP[1]. Until SGX is properly supported in KVM, ensure a guest sees expected behavior for ENCLS, i.e. all ENCLS #UD. Because SGX does not have a true software enable bit, e.g. there is no CR4.SGXE bit, the ENCLS instruction can be executed[1] by the guest if SGX is supported by the system. Intercept all ENCLS leafs (via the ENCLS- exiting control and field) and unconditionally inject #UD. [1] https://www.spinics.net/lists/kvm/msg171333.html or https://lkml.org/lkml/2018/7/3/879 [2] A guest can execute ENCLS in the sense that ENCLS will not take an immediate #UD, but no ENCLS will ever succeed in a guest without explicit support from KVM (map EPC memory into the guest), unless KVM has a *very* egregious bug, e.g. accidentally mapped EPC memory into the guest SPTEs. In other words this patch is needed only to prevent the guest from seeing inconsistent behavior, e.g. #GP (SGX not enabled in Feature Control MSR) or #PF (leaf operand(s) does not point at EPC memory) instead of #UD on ENCLS. Intercepting ENCLS is not required to prevent the guest from truly utilizing SGX. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-22KVM: vmx: Add defines for SGX ENCLS exitingSean Christopherson1-0/+3
Hardware support for basic SGX virtualization adds a new execution control (ENCLS_EXITING), VMCS field (ENCLS_EXITING_BITMAP) and exit reason (ENCLS), that enables a VMM to intercept specific ENCLS leaf functions, e.g. to inject faults when the VMM isn't exposing SGX to a VM. When ENCLS_EXITING is enabled, the VMM can set/clear bits in the bitmap to intercept/allow ENCLS leaf functions in non-root, e.g. setting bit 2 in the ENCLS_EXITING_BITMAP will cause ENCLS[EINIT] to VMExit(ENCLS). Note: EXIT_REASON_ENCLS was previously added by commit 1f5199927034 ("KVM: VMX: add missing exit reasons"). Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-22x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush()Yi Wang1-9/+9
Substitute spaces with tab. No functional changes. Signed-off-by: Yi Wang <[email protected]> Reviewed-by: Jiang Biao <[email protected]> Message-Id: <[email protected]> Cc: [email protected] # L1TF Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-22x86: kvm: avoid unused variable warningArnd Bergmann1-3/+1
Removing one of the two accesses of the maxphyaddr variable led to a harmless warning: arch/x86/kvm/x86.c: In function 'kvm_set_mmio_spte_mask': arch/x86/kvm/x86.c:6563:6: error: unused variable 'maxphyaddr' [-Werror=unused-variable] Removing the #ifdef seems to be the nicest workaround, as it makes the code look cleaner than adding another #ifdef. Fixes: 28a1f3ac1d0c ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs") Signed-off-by: Arnd Bergmann <[email protected]> Cc: [email protected] # L1TF Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-22Merge tag 'acpi-4.19-rc1-2' of ↵Linus Torvalds19-56/+260
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more ACPI updates from Rafael Wysocki: "These update the ACPICA code in the kernel to the most recent upstream revision (which includes a regression fix and other improvements), make ACPICA clear the status of all ACPI events when entering sleep states (to restore the previous behavior) and update the ACPI operation region driver for the CrystalCove PMIC. Specifics: - Update the ACPICA code in the kernel to upstream revision 20180810 including: * Fix for AML parser regression causing it to mishandle opcodes that open a scope upon parse failures (Erik Schmauss) * Fix for a reference counting issue on large systems (Erik Schmauss) * Fix to discard values coming from register reads that have failed (Erik Schmauss) * Two acpiexec fixes (Bob Moore, Erik Schmauss) * Debugger cleanup (Bob Moore) * Cleanup of duplicate table error message (Bob Moore) * Cleanup of hex detection in the utilities (Erik Schmauss) - Make ACPICA clear the status of all ACPI events when entering sleep states again to avoid functional regressions (Rafael Wysocki) - Update the ACPI operation region driver for the CrystalCove PMIC to cover all of the known operation region fields (Hans de Goede)" * tag 'acpi-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / PMIC: CrystalCove: Extend PMOP support to support all possible fields ACPICA: Clear status of all events when entering sleep states ACPICA: Update version to 20180810 ACPICA: acpiexec: fix a small memory leak regression ACPICA: Reference Counts: increase max to 0x4000 for large servers ACPICA: Reference count: add additional debugging details ACPICA: acpi_exec: fixing -fi option ACPICA: Debugger: Cleanup interface to the AML disassembler ACPICA: AML Parser: skip opcodes that open a scope upon parse failure ACPICA: Utilities: split hex detection into smaller functions ACPICA: Update an error message for a duplicate table ACPICA: ACPICA: add status check for acpi_hw_read before assigning return value ACPICA: AML Parser: ignore all exceptions resulting from incorrect AML during table load
2018-08-22Merge tag 'pm-4.19-rc1-2' of ↵Linus Torvalds6-25/+43
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These fix the main idle loop and the menu cpuidle governor, clean up the latter, fix a mistake in the PCI bus type's support for system suspend and resume, fix the ondemand and conservative cpufreq governors, address a build issue in the system wakeup framework and make the ACPI C-states desciptions less confusing. Specifics: - Make the idle loop handle stopped scheduler tick correctly (Rafael Wysocki). - Prevent the menu cpuidle governor from letting CPUs spend too much time in shallow idle states when it is invoked with scheduler tick stopped and clean it up somewhat (Rafael Wysocki). - Avoid invoking the platform firmware to make the platform enter the ACPI S3 sleep state with suspended PCIe root ports which may confuse the firmware and cause it to crash (Rafael Wysocki). - Fix sysfs-related race in the ondemand and conservative cpufreq governors which may cause the system to crash if the governor module is removed during an update of CPU frequency limits (Henry Willard). - Select SRCU when building the system wakeup framework to avoid a build issue in it (zhangyi). - Make the descriptions of ACPI C-states vendor-neutral to avoid confusion (Prarit Bhargava)" * tag 'pm-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpuidle: menu: Handle stopped tick more aggressively sched: idle: Avoid retaining the tick when it has been stopped PCI / ACPI / PM: Resume all bridges on suspend-to-RAM cpuidle: menu: Update stale polling override comment cpufreq: governor: Avoid accessing invalid governor_data x86/ACPI/cstate: Make APCI C1 FFH MWAIT C-state description vendor-neutral cpuidle: menu: Fix white space PM / sleep: wakeup: Fix build error caused by missing SRCU support
2018-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ideLinus Torvalds7-9/+8
Pull IDE updates from David Miller: - Remove redundant variables (Colin Ian King) - Expected switch fall-through annotations (Gustavo A. R. Silva) * git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide: ide: mark expected switch fall-throughs ide-tape: remove redundant variable buffer_size ide: remove redundant variables queue_run_ms and left
2018-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds5-249/+88
Pull sparc updates from David Miller: "Nothing super serious: - Convert sparc32 over to NO_BOOTMEM (Mike Rapoport) - Use dma_noncoherent_ops on sparc32 (Christoph Hellwig) - Fix kbuild defconfig handling on sparc32 (Masahiro Yamada)" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: fix KBUILD_DEFCONFIG for ARCH=sparc32 sparc32: split ramdisk detection and reservation to a helper function sparc32: switch to NO_BOOTMEM sparc: mm/init_32: kill trailing whitespace sparc: use generic dma_noncoherent_ops
2018-08-22initramfs: move gen_initramfs_list.sh from scripts/ to usr/Masahiro Yamada4-5/+5
scripts/gen_initramfs_list.sh is only invoked from usr/Makefile. Move it so that all tools to create initramfs are self-contained in the usr/ directory. Signed-off-by: Masahiro Yamada <[email protected]>
2018-08-22vmlinux.lds.h: remove stale <linux/export.h> includeMasahiro Yamada1-2/+0
This is unneeded since commit a62143850053 ("vmlinux.lds.h: remove no-op macro VMLINUX_SYMBOL()"). Signed-off-by: Masahiro Yamada <[email protected]>
2018-08-22export.h: remove VMLINUX_SYMBOL() and VMLINUX_SYMBOL_STR()Masahiro Yamada3-17/+10
With the special case handling for Blackfin and Metag was removed by commit 94e58e0ac312 ("export.h: remove code for prefixing symbols with underscore"), VMLINUX_SYMBOL() is no-op. Replace the remaining usages, then remove the definition of VMLINUX_SYMBOL() and VMLINUX_SYMBOL_STR(). Signed-off-by: Masahiro Yamada <[email protected]>
2018-08-22Coccinelle: remove pci_alloc_consistent semantic to detect in ↵zhong jiang1-40/+1
zalloc-simple.cocci Because pci_alloc_consistent has been deprecated. We prefer to use dma_alloc_coherent directly. Therefore, we should remove pci_alloc_consistent to increase the confidence. Acked-by: Julia Lawall <[email protected]> Acked-by: Himanshu Jha <[email protected]> Signed-off-by: zhong jiang <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2018-08-22kbuild: make sorting initramfs contents independent of localeAndrzej Pietrasiewicz1-1/+1
Some LANG values (e.g. pl_PL.UTF-8) cause the sort command to output files before their parent directories, which makes them inaccessible for the kernel. In other words, when the kernel populates the rootfs, it is unable to create files whose parent directories have not been yet created. This patch makes sorting use the default (LANG=C) locale, which results in correctly laid out initramfs images (parent directories before files). Signed-off-by: Andrzej Pietrasiewicz <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2018-08-22kbuild: remove "rpm" target, which is alias of "rpm-pkg"Masahiro Yamada1-4/+0
As commit ebaad7d36406 ("kbuild: rpm: prompt to use "rpm-pkg" if "rpm" target is used") noticed, the "rpm" target is now removed. I assume people have already migrated to "rpm-pkg". Signed-off-by: Masahiro Yamada <[email protected]>
2018-08-22kbuild: Fix LOADLIBES rename in Documentation/kbuild/makefiles.txtMichal Suchanek1-1/+1
Fixes: 8377bd2b9ee1 ("kbuild: Rename HOST_LOADLIBES to KBUILD_HOSTLDLIBS") Signed-off-by: Michal Suchanek <[email protected]> Acked-by: Laura Abbott <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2018-08-22kconfig: suppress "configuration written to .config" for syncconfigMasahiro Yamada1-0/+5
The top-level Makefile invokes "make syncconfig" when necessary. Then, Kconfig displays the following message when .config is updated. # # configuration written to .config # It is distracting because "make syncconfig" happens during the build stage, and does nothing important in most cases. Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2018-08-22kconfig: fix "Can't open ..." in parallel buildMasahiro Yamada1-2/+3
If you run "make menuconfig" or "make nconfig" with -j<N> option in a fresh source tree, you will see several "Can't open ..." messages: $ make -j8 menuconfig HOSTCC scripts/basic/fixdep YACC scripts/kconfig/zconf.tab.c LEX scripts/kconfig/zconf.lex.c /bin/sh: 1: .: Can't open scripts/kconfig/.mconf-cfg /bin/sh: 1: .: Can't open scripts/kconfig/.mconf-cfg /bin/sh: 1: .: Can't open scripts/kconfig/.mconf-cfg /bin/sh: 1: .: HOSTCC scripts/kconfig/lxdialog/checklist.o Can't open scripts/kconfig/.mconf-cfg /bin/sh: 1: .: Can't open scripts/kconfig/.mconf-cfg /bin/sh: 1: .: Can't open scripts/kconfig/.mconf-cfg /bin/sh: 1: .: Can't open scripts/kconfig/.mconf-cfg HOSTCC scripts/kconfig/lxdialog/inputbox.o /bin/sh: 1: .: Can't open scripts/kconfig/.mconf-cfg /bin/sh: 1: .: Can't open scripts/kconfig/.mconf-cfg /bin/sh: 1: .: Can't open scripts/kconfig/.mconf-cfg UPD scripts/kconfig/.mconf-cfg /bin/sh: 1: .: Can't open scripts/kconfig/.mconf-cfg HOSTCC scripts/kconfig/lxdialog/menubox.o HOSTCC scripts/kconfig/lxdialog/textbox.o HOSTCC scripts/kconfig/lxdialog/util.o HOSTCC scripts/kconfig/lxdialog/yesno.o HOSTCC scripts/kconfig/mconf.o HOSTCC scripts/kconfig/zconf.tab.o HOSTLD scripts/kconfig/mconf Correct dependencies to fix this problem. Fixes: 1c5af5cf9308 ("kconfig: refactor ncurses package checks for building mconf and nconf") Cc: linux-stable <[email protected]> # v4.18 Reported-by: Borislav Petkov <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]> Tested-by: Borislav Petkov <[email protected]>
2018-08-22kbuild: Add a space after `!` to prevent parsing as file patternMichael Forney1-1/+1
Some shells use !(pattern|...|pattern) to match file names not containing the specified patterns. This may result in output like $ ./scripts/clang-version.sh gcc ./scripts/clang-version.sh[18]: COPYING: not found printf: %d __clang_major__: conversion error printf: %d __clang_minor__: conversion error printf: %d __clang_patchlevel__: conversion error 00000 $ and set CONFIG_CLANG_VERSION to the invalid value '00000'. POSIX says[0] If the pipeline begins with the reserved word ! and command1 is a subshell command, the application shall ensure that the ( operator at the beginning of command1 is separated from the ! by one or more <blank> characters. The behavior of the reserved word ! immediately followed by the ( operator is unspecified. So, just add a <blank> to prevent this. [0] http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_09_02 Signed-off-by: Michael Forney <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2018-08-22scripts: modpost: check memory allocation resultsRandy Dunlap1-4/+4
Fix missing error check for memory allocation functions in scripts/mod/modpost.c. Fixes kernel bugzilla #200319: https://bugzilla.kernel.org/show_bug.cgi?id=200319 Signed-off-by: Randy Dunlap <[email protected]> Cc: Yuexing Wang <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>