aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-05-14mm: maintain randomization of page free listsDan Williams4-2/+56
When freeing a page with an order >= shuffle_page_order randomly select the front or back of the list for insertion. While the mm tries to defragment physical pages into huge pages this can tend to make the page allocator more predictable over time. Inject the front-back randomness to preserve the initial randomness established by shuffle_free_memory() when the kernel was booted. The overhead of this manipulation is constrained by only being applied for MAX_ORDER sized pages by default. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Keith Busch <[email protected]> Cc: Robert Elliott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: move buddy list manipulations into helpersDan Williams5-48/+73
In preparation for runtime randomization of the zone lists, take all (well, most of) the list_*() functions in the buddy allocator and put them in helper functions. Provide a common control point for injecting additional behavior when freeing pages. [[email protected]: fix buddy list helpers] Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com [[email protected]: remove del_page_from_free_area() migratetype parameter] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]> Tested-by: Tetsuo Handa <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kees Cook <[email protected]> Cc: Keith Busch <[email protected]> Cc: Robert Elliott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: shuffle initial free memory to improve memory-side-cache utilizationDan Williams9-2/+302
Patch series "mm: Randomize free memory", v10. This patch (of 3): Randomization of the page allocator improves the average utilization of a direct-mapped memory-side-cache. Memory side caching is a platform capability that Linux has been previously exposed to in HPC (high-performance computing) environments on specialty platforms. In that instance it was a smaller pool of high-bandwidth-memory relative to higher-capacity / lower-bandwidth DRAM. Now, this capability is going to be found on general purpose server platforms where DRAM is a cache in front of higher latency persistent memory [1]. Robert offered an explanation of the state of the art of Linux interactions with memory-side-caches [2], and I copy it here: It's been a problem in the HPC space: http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/ A kernel module called zonesort is available to try to help: https://software.intel.com/en-us/articles/xeon-phi-software and this abandoned patch series proposed that for the kernel: https://lkml.kernel.org/r/[email protected] Dan's patch series doesn't attempt to ensure buffers won't conflict, but also reduces the chance that the buffers will. This will make performance more consistent, albeit slower than "optimal" (which is near impossible to attain in a general-purpose kernel). That's better than forcing users to deploy remedies like: "To eliminate this gradual degradation, we have added a Stream measurement to the Node Health Check that follows each job; nodes are rebooted whenever their measured memory bandwidth falls below 300 GB/s." A replacement for zonesort was merged upstream in commit cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability"). With this numa_emulation capability, memory can be split into cache sized ("near-memory" sized) numa nodes. A bind operation to such a node, and disabling workloads on other nodes, enables full cache performance. However, once the workload exceeds the cache size then cache conflicts are unavoidable. While HPC environments might be able to tolerate time-scheduling of cache sized workloads, for general purpose server platforms, the oversubscribed cache case will be the common case. The worst case scenario is that a server system owner benchmarks a workload at boot with an un-contended cache only to see that performance degrade over time, even below the average cache performance due to excessive conflicts. Randomization clips the peaks and fills in the valleys of cache utilization to yield steady average performance. Here are some performance impact details of the patches: 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a 3X speedup in a contrived case that tries to force cache conflicts. The contrived cased used the numa_emulation capability to force an instance of the benchmark to be run in two of the near-memory sized numa nodes. If both instances were placed on the same emulated they would fit and cause zero conflicts. While on separate emulated nodes without randomization they underutilized the cache and conflicted unnecessarily due to the in-order allocation per node. 2/ A well known Java server application benchmark was run with a heap size that exceeded cache size by 3X. The cache conflict rate was 8% for the first run and degraded to 21% after page allocator aging. With randomization enabled the rate levelled out at 11%. 3/ A MongoDB workload did not observe measurable difference in cache-conflict rates, but the overall throughput dropped by 7% with randomization in one case. 4/ Mel Gorman ran his suite of performance workloads with randomization enabled on platforms without a memory-side-cache and saw a mix of some improvements and some losses [3]. While there is potentially significant improvement for applications that depend on low latency access across a wide working-set, the performance may be negligible to negative for other workloads. For this reason the shuffle capability defaults to off unless a direct-mapped memory-side-cache is detected. Even then, the page_alloc.shuffle=0 parameter can be specified to disable the randomization on those systems. Outside of memory-side-cache utilization concerns there is potentially security benefit from randomization. Some data exfiltration and return-oriented-programming attacks rely on the ability to infer the location of sensitive data objects. The kernel page allocator, especially early in system boot, has predictable first-in-first out behavior for physical pages. Pages are freed in physical address order when first onlined. Quoting Kees: "While we already have a base-address randomization (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and memory layouts would certainly be using the predictability of allocation ordering (i.e. for attacks where the base address isn't important: only the relative positions between allocated memory). This is common in lots of heap-style attacks. They try to gain control over ordering by spraying allocations, etc. I'd really like to see this because it gives us something similar to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator." While SLAB_FREELIST_RANDOM reduces the predictability of some local slab caches it leaves vast bulk of memory to be predictably in order allocated. However, it should be noted, the concrete security benefits are hard to quantify, and no known CVE is mitigated by this randomization. Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform a Fisher-Yates shuffle of the page allocator 'free_area' lists when they are initially populated with free memory at boot and at hotplug time. Do this based on either the presence of a page_alloc.shuffle=Y command line parameter, or autodetection of a memory-side-cache (to be added in a follow-on patch). The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10, 4MB this trades off randomization granularity for time spent shuffling. MAX_ORDER-1 was chosen to be minimally invasive to the page allocator while still showing memory-side cache behavior improvements, and the expectation that the security implications of finer granularity randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The performance impact of the shuffling appears to be in the noise compared to other memory initialization work. This initial randomization can be undone over time so a follow-on patch is introduced to inject entropy on page free decisions. It is reasonable to ask if the page free entropy is sufficient, but it is not enough due to the in-order initial freeing of pages. At the start of that process putting page1 in front or behind page0 still keeps them close together, page2 is still near page1 and has a high chance of being adjacent. As more pages are added ordering diversity improves, but there is still high page locality for the low address pages and this leads to no significant impact to the cache conflict rate. [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM [3]: https://lkml.org/lkml/2018/10/12/309 [[email protected]: fix shuffle enable] Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com [[email protected]: fix SHUFFLE_PAGE_ALLOCATOR help texts] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Qian Cai <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Keith Busch <[email protected]> Cc: Robert Elliott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/vmalloc.c: convert vmap_lazy_nr to atomic_long_tUladzislau Rezki (Sony)1-10/+10
vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long" that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on 64 bit as well. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: William Kucharski <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Thomas Garnier <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Joel Fernandes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/vmalloc.c: add priority threshold to __purge_vmap_area_lazy()Uladzislau Rezki (Sony)1-6/+12
Commit 763b218ddfaf ("mm: add preempt points into __purge_vmap_area_lazy()") introduced some preempt points, one of those is making an allocation more prioritized over lazy free of vmap areas. Prioritizing an allocation over freeing does not work well all the time, i.e. it should be rather a compromise. 1) Number of lazy pages directly influences the busy list length thus on operations like: allocation, lookup, unmap, remove, etc. 2) Under heavy stress of vmalloc subsystem I run into a situation when memory usage gets increased hitting out_of_memory -> panic state due to completely blocking of logic that frees vmap areas in the __purge_vmap_area_lazy() function. Establish a threshold passing which the freeing is prioritized back over allocation creating a balance between each other. Using vmalloc test driver in "stress mode", i.e. When all available test cases are run simultaneously on all online CPUs applying a pressure on the vmalloc subsystem, my HiKey 960 board runs out of memory due to the fact that __purge_vmap_area_lazy() logic simply is not able to free pages in time. How I run it: 1) You should build your kernel with CONFIG_TEST_VMALLOC=m 2) ./tools/testing/selftests/vm/test_vmalloc.sh stress During this test "vmap_lazy_nr" pages will go far beyond acceptable lazy_max_pages() threshold, that will lead to enormous busy list size and other problems including allocation time and so on. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Thomas Garnier <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Joel Fernandes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14kernel/sched/psi.c: expose pressure metrics on root cgroupDan Schatzberg3-7/+14
Pressure metrics are already recorded and exposed in procfs for the entire system, but any tool which monitors cgroup pressure has to special case the root cgroup to read from procfs. This patch exposes the already recorded pressure metrics on the root cgroup. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dan Schatzberg <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Li Zefan <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: introduce psi monitorSuren Baghdasaryan5-20/+742
Psi monitor aims to provide a low-latency short-term pressure detection mechanism configurable by users. It allows users to monitor psi metrics growth and trigger events whenever a metric raises above user-defined threshold within user-defined time window. Time window and threshold are both expressed in usecs. Multiple psi resources with different thresholds and window sizes can be monitored concurrently. Psi monitors activate when system enters stall state for the monitored psi metric and deactivate upon exit from the stall state. While system is in the stall state psi signal growth is monitored at a rate of 10 times per tracking window. Min window size is 500ms, therefore the min monitoring interval is 50ms. Max window size is 10s with monitoring interval of 1s. When activated psi monitor stays active for at least the duration of one tracking window to avoid repeated activations/deactivations when psi signal is bouncing. Notifications to the users are rate-limited to one per tracking window. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14include/: refactor headers to allow kthread.h inclusion in psi_types.hSuren Baghdasaryan4-2/+4
kthread.h can't be included in psi_types.h because it creates a circular inclusion with kthread.h eventually including psi_types.h and complaining on kthread structures not being defined because they are defined further in the kthread.h. Resolve this by removing psi_types.h inclusion from the headers included from kthread.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: track changed statesSuren Baghdasaryan1-6/+18
Introduce changed_states parameter into collect_percpu_times to track the states changed since the last update. This will be needed to detect whether polled states activated in the monitor patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: split update_stats into partsSuren Baghdasaryan1-23/+34
Split update_stats into collect_percpu_times and update_averages for collect_percpu_times to be reused later inside psi monitor. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: rename psi fields in preparation for psi trigger additionSuren Baghdasaryan2-27/+28
Rename psi_group structure member fields used for calculating psi totals and averages for clear distinction between them and for trigger-related fields that will be added by "psi: introduce psi monitor". [[email protected]: v6] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: make psi_enable staticSuren Baghdasaryan1-2/+2
psi_enable is not used outside of psi.c, make it static. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: introduce state_mask to represent stalled psi statesSuren Baghdasaryan2-13/+25
Patch series "psi: pressure stall monitors", v6. This is a respin of: https://lwn.net/ml/linux-kernel/20190308184311.144521-1-surenb%40google.com/ Android is adopting psi to detect and remedy memory pressure that results in stuttering and decreased responsiveness on mobile devices. Psi gives us the stall information, but because we're dealing with latencies in the millisecond range, periodically reading the pressure files to detect stalls in a timely fashion is not feasible. Psi also doesn't aggregate its averages at a high-enough frequency right now. This patch series extends the psi interface such that users can configure sensitive latency thresholds and use poll() and friends to be notified when these are breached. As high-frequency aggregation is costly, it implements an aggregation method that is optimized for fast, short-interval averaging, and makes the aggregation frequency adaptive, such that high-frequency updates only happen while monitored stall events are actively occurring. With these patches applied, Android can monitor for, and ward off, mounting memory shortages before they cause problems for the user. For example, using memory stall monitors in userspace low memory killer daemon (lmkd) we can detect mounting pressure and kill less important processes before device becomes visibly sluggish. In our memory stress testing psi memory monitors produce roughly 10x less false positives compared to vmpressure signals. Having ability to specify multiple triggers for the same psi metric allows other parts of Android framework to monitor memory state of the device and act accordingly. The new interface is straight-forward. The user opens one of the pressure files for writing and writes a trigger description into the file descriptor that defines the stall state - some or full, and the maximum stall time over a given window of time. E.g.: /* Signal when stall time exceeds 100ms of a 1s window */ char trigger[] = "full 100000 1000000" fd = open("/proc/pressure/memory") write(fd, trigger, sizeof(trigger)) while (poll() >= 0) { ... }; close(fd); When the monitored stall state is entered, psi adapts its aggregation frequency according to what the configured time window requires in order to emit event signals in a timely fashion. Once the stalling subsides, aggregation reverts back to normal. The trigger is associated with the open file descriptor. To stop monitoring, the user only needs to close the file descriptor and the trigger is discarded. Patches 1-6 prepare the psi code for polling support. Patch 7 implements the adaptive polling logic, the pressure growth detection optimized for short intervals, and hooks up write() and poll() on the pressure files. The patches were developed in collaboration with Johannes Weiner. This patch (of 7): The psi monitoring patches will need to determine the same states as record_times(). To avoid calculating them twice, maintain a state mask that can be consulted cheaply. Do this in a separate patch to keep the churn in the main feature patch at a minimum. This adds 4-byte state_mask member into psi_group_cpu struct which results in its first cacheline-aligned part becoming 52 bytes long. Add explicit values to enumeration element counters that affect psi_group_cpu struct size. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: update references to page _refcountBaruch Siach2-2/+2
Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount") left out a couple of references to the old field name. Fix that. Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount") Signed-off-by: Baruch Siach <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCEAndrea Arcangeli1-3/+3
The RCU reader uses rcu_dereference() inside rcu_read_lock critical sections, so the writer shall use WRITE_ONCE. Just a cleanup, we still rely on gcc to emit atomic writes in other places. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Xu <[email protected]> Cc: zhong jiang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14userfaultfd: use RCU to free the task struct when fork failsAndrea Arcangeli1-2/+29
The task structure is freed while get_mem_cgroup_from_mm() holds rcu_read_lock() and dereferences mm->owner. get_mem_cgroup_from_mm() failing fork() ---- --- task = mm->owner mm->owner = NULL; free(task) if (task) *task; /* use after free */ The fix consists in freeing the task with RCU also in the fork failure case, exactly like it always happens for the regular exit(2) path. That is enough to make the rcu_read_lock hold in get_mem_cgroup_from_mm() (left side above) effective to avoid a use after free when dereferencing the task structure. An alternate possible fix would be to defer the delivery of the userfaultfd contexts to the monitor until after fork() is guaranteed to succeed. Such a change would require more changes because it would create a strict ordering dependency where the uffd methods would need to be called beyond the last potentially failing branch in order to be safe. This solution as opposed only adds the dependency to common code to set mm->owner to NULL and to free the task struct that was pointed by mm->owner with RCU, if fork ends up failing. The userfaultfd methods can still be called anywhere during the fork runtime and the monitor will keep discarding orphaned "mm" coming from failed forks in userland. This race condition couldn't trigger if CONFIG_MEMCG was set =n at build time. [[email protected]: improve changelog, reduce #ifdefs per Michal] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Fixes: 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event") Signed-off-by: Andrea Arcangeli <[email protected]> Tested-by: zhong jiang <[email protected]> Reported-by: [email protected] Cc: Oleg Nesterov <[email protected]> Cc: Jann Horn <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Peter Xu <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: zhong jiang <[email protected]> Cc: [email protected] Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14kernel/Makefile: don't assume that kernel/gen_ikh_data.sh is executableAndrew Morton1-1/+1
If the user downloads and applies patch-5.1.gz using patch(1), the x bit on kernel/gen_ikh_data.sh is not set. /bin/sh: 1: ./kernel/gen_ikh_data.sh: Permission denied Fix this by using CONFIG_SHELL. Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14io_uring: fix failure to verify SQ_AFF cpuJens Axboe1-2/+3
The test case we have is rightfully failing with the current kernel: io_uring_setup(1, 0x7ffe2cafebe0), flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 4 expected -1, got 3 This is in a vm, and CPU3 is the last valid one, hence asking for 4 should fail the setup with -EINVAL, not succeed. The problem is that we're using array_index_nospec() with nr_cpu_ids as the index, hence we wrap and end up using CPU0 instead of CPU4. This makes the setup succeed where it should be failing. We don't need to use array_index_nospec() as we're not indexing any array with this. Instead just compare with nr_cpu_ids directly. This is fine as we're checking with cpu_online() afterwards. Signed-off-by: Jens Axboe <[email protected]>
2019-05-15powerpc/mm: Fix crashes with hugepages & 4K pagesMichael Ellerman1-1/+1
The recent commit to cleanup ifdefs in the hugepage initialisation led to crashes when using 4K pages as reported by Sachin: BUG: Kernel NULL pointer dereference at 0x0000001c Faulting instruction address: 0xc000000001d1e58c Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries ... CPU: 3 PID: 4635 Comm: futex_wake04 Tainted: G W O 5.1.0-next-20190507-autotest #1 NIP: c000000001d1e58c LR: c000000001d1e54c CTR: 0000000000000000 REGS: c000000004937890 TRAP: 0300 MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22424822 XER: 00000000 CFAR: c00000000183e9e0 DAR: 000000000000001c DSISR: 40000000 IRQMASK: 0 ... NIP kmem_cache_alloc+0xbc/0x5a0 LR kmem_cache_alloc+0x7c/0x5a0 Call Trace: huge_pte_alloc+0x580/0x950 hugetlb_fault+0x9a0/0x1250 handle_mm_fault+0x490/0x4a0 __do_page_fault+0x77c/0x1f00 do_page_fault+0x28/0x50 handle_page_fault+0x18/0x38 This is caused by us trying to allocate from a NULL kmem cache in __hugepte_alloc(). The kmem cache is NULL because it was never allocated in hugetlbpage_init(), because add_huge_page_size() returned an error. The reason add_huge_page_size() returned an error is a simple typo, we are calling check_and_get_huge_psize(size) when we should be passing shift instead. The fact that we're able to trigger this path when the kmem caches are NULL is a separate bug, ie. we should not advertise any hugepage sizes if we haven't setup the required caches for them. This was only seen with 4K pages, with 64K pages we don't need to allocate any extra kmem caches because the 16M hugepage just occupies a single entry at the PMD level. Fixes: 723f268f19da ("powerpc/mm: cleanup ifdef mess in add_huge_page_size()") Reported-by: Sachin Sant <[email protected]> Tested-by: Sachin Sant <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Reviewed-by: Christophe Leroy <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]>
2019-05-14selftests: avoid KBUILD_OUTPUT dir cluttering with selftest objectsShuah Khan1-1/+4
Running "make kselftest" or building selftests when KBUILD_OUTPUT is set, will create selftest objects in the KBUILD_OUTPUT directory. This could be undesirable especially when user didn't intend to relocate selftest objects. Use KBUILD_OUTPUT/kselftest to create selftest objects instead of cluttering the main directory. Signed-off-by: Shuah Khan <[email protected]>
2019-05-14selftests: drivers: Create .gitignore to include /dma-buf/udmabufKelsey Skunberg1-0/+1
Create ../selftests/drivers/.gitignore which holds the following file name created after compiling: - /dma-buf/udmabuf Signed-off-by: Kelsey Skunberg <[email protected]> Signed-off-by: Shuah Khan <[email protected]>
2019-05-14selftests: pidfd: Create .gitignore to include pidfd_testKelsey Skunberg1-0/+1
Create ../selftests/pidfd/.gitignore which holds the following file name created after compiling: - pidfd_test Signed-off-by: Kelsey Skunberg <[email protected]> Signed-off-by: Shuah Khan <[email protected]>
2019-05-15drm/pl111: Initialize clock spinlock earlyGuenter Roeck1-2/+3
The following warning is seen on systems with broken clock divider. INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 0 PID: 1 Comm: swapper Not tainted 5.1.0-09698-g1fb3b52 #1 Hardware name: ARM Integrator/CP (Device Tree) [<c0011be8>] (unwind_backtrace) from [<c000ebb8>] (show_stack+0x10/0x18) [<c000ebb8>] (show_stack) from [<c07d3fd0>] (dump_stack+0x18/0x24) [<c07d3fd0>] (dump_stack) from [<c0060d48>] (register_lock_class+0x674/0x6f8) [<c0060d48>] (register_lock_class) from [<c005de2c>] (__lock_acquire+0x68/0x2128) [<c005de2c>] (__lock_acquire) from [<c0060408>] (lock_acquire+0x110/0x21c) [<c0060408>] (lock_acquire) from [<c07f755c>] (_raw_spin_lock+0x34/0x48) [<c07f755c>] (_raw_spin_lock) from [<c0536c8c>] (pl111_display_enable+0xf8/0x5fc) [<c0536c8c>] (pl111_display_enable) from [<c0502f54>] (drm_atomic_helper_commit_modeset_enables+0x1ec/0x244) Since commit eedd6033b4c8 ("drm/pl111: Support variants with broken clock divider"), the spinlock is not initialized if the clock divider is broken. Initialize it earlier to fix the problem. Fixes: eedd6033b4c8 ("drm/pl111: Support variants with broken clock divider") Cc: Linus Walleij <[email protected]> Signed-off-by: Guenter Roeck <[email protected]> Signed-off-by: Linus Walleij <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2019-05-14cifs:smbd Use the correct DMA direction when sending dataLong Li1-3/+5
When sending data, use the DMA_TO_DEVICE to map buffers. Also log the number of requests in a compounding request from upper layer. Signed-off-by: Long Li <[email protected]> Signed-off-by: Steve French <[email protected]> Acked-by: Pavel Shilovsky <[email protected]> Acked-by: Ronnie Sahlberg <[email protected]>
2019-05-14cifs:smbd When reconnecting to server, call smbd_destroy() after all MIDs ↵Long Li1-17/+20
have been called commit 214bab448476 ("cifs: Call MID callback before destroying transport") assumes that the MID callback should not take srv_mutex, this may not always be true. SMB Direct requires the MID callback completed before calling transport so all pending memory registration can be freed. So restore the original calling sequence so TCP transport will use the same code, but moving smbd_destroy() after all MID has been called. fixes: 214bab448476 ("cifs: Call MID callback before destroying transport") Signed-off-by: Long Li <[email protected]> Signed-off-by: Steve French <[email protected]> Reviewed-by: Pavel Shilovsky <[email protected]>
2019-05-14power: supply: olpc_battery: force the le/be castsLubomir Rintel1-3/+3
The endianness of data returned from the EC depends on the particular EC version determined at run time. Cast from little/big endian explicitey in the routine that flips endianness to the native one to make sparse happy. Signed-off-by: Lubomir Rintel <[email protected]> Reported-by: kbuild test robot <[email protected]> Fixes: 76311b9a3295 ("power: supply: olpc_battery: Add OLPC XO 1.75 support") Signed-off-by: Sebastian Reichel <[email protected]>
2019-05-14Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds5-59/+40
Pull virtio updates from Michael Tsirkin: - enable packed ring support for s390 - several fixes * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio/s390: enable packed ring virtio/s390: DMA support for virtio-ccw virtio/s390: use vring_create_virtqueue virtio/virtio_ring: do some comment fixes vhost-scsi: remove incorrect memory barrier tools/virtio/ringtest: Remove bogus definition of BUG_ON() virtio_ring: Fix potential mem leak in virtqueue_add_indirect_packed
2019-05-14Add gitignore file for samples/vfs/ generated filesLinus Torvalds1-0/+2
Commit f1b5618e013a ("vfs: Add a sample program for the new mount API") added sample programs that get built during the kernel build, but then cause 'git status' to worry about whether the resulting binaries should be managed by git. Tell git not to worry, and to ignore the sample binaries. Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14Merge branch 'parisc-5.2-2' of ↵Linus Torvalds15-72/+73
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull more parisc updates from Helge Deller: "Two small enhancements, which I didn't included in the last pull request because I wanted to keep them a few more days in for-next before sending upstream: - Replace the ldcw barrier instruction by a nop instruction in the CAS code on uniprocessor machines. - Map variables read-only after init (enable ro_after_init feature)" * 'parisc-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Use __ro_after_init in init.c parisc: Use __ro_after_init in unwind.c parisc: Use __ro_after_init in time.c parisc: Use __ro_after_init in processor.c parisc: Use __ro_after_init in process.c parisc: Use __ro_after_init in perf_images.h parisc: Use __ro_after_init in pci.c parisc: Use __ro_after_init in inventory.c parisc: Use __ro_after_init in head.S parisc: Use __ro_after_init in firmware.c parisc: Use __ro_after_init in drivers.c parisc: Use __ro_after_init in cache.c parisc: Enable the ro_after_init feature parisc: Drop LDCW barrier in CAS code when running UP
2019-05-14Merge tag 'kgdb-5.2-rc1' of ↵Linus Torvalds5-9/+8
git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb updates from Daniel Thompson: "Mostly cleanups but there are also a couple of fixes for out-of-bounds accesses (including a potential write to the byte before a static buffer). The main changes are: - Fixes to those out-of-bounds access (empty string to configure test module could write the byte before a buffer, high cpu counts could read outside of per-cpu structures). - Improvements to string handling problems picked up by new compiler warnings and other static checks. Most are fixing benign issues that can't be tickled without code changes but still reduce the wtf factor a little. - Tidy up the terminal output" * tag 'kgdb-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kdb: Fix bound check compiler warning kdb: do a sanity check on the cpu in kdb_per_cpu() kdb: Get rid of broken attempt to print CCVERSION in kdb summary misc: kgdbts: fix out-of-bounds access in function param_set_kgdbts_var kdb: kdb_support: replace strcpy() by strscpy() gdbstub: Replace strcpy() by strscpy() gdbstub: mark expected switch fall-throughs
2019-05-14Merge tag 'modules-for-v5.2' of ↵Linus Torvalds4-9/+27
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules updates from Jessica Yu: - Use a separate table to store symbol types instead of hijacking fields in struct Elf_Sym - Trivial code cleanups * tag 'modules-for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: add stubs for within_module functions kallsyms: store type information in its own array vmlinux.lds.h: drop unused __vermagic
2019-05-14drm/msm: correct attempted NULL pointer dereference in debugfsBrian Masney1-1/+2
msm_gem_describe() would attempt to dereference a NULL pointer via the address space pointer when no IOMMU is present. Correct this by adding the appropriate check. Signed-off-by: Brian Masney <[email protected]> Fixes: 575f0485508b ("drm/msm: Clean up and enhance the output of the 'gem' debugfs node") Signed-off-by: Sean Paul <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2019-05-14Merge tag 'backlight-next-5.2' of ↵Linus Torvalds15-45/+293
git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight Pull backlight updates from Lee Jones: "Fix-ups: - Remove unused BACKLIGHT_LCD_SUPPORT symbol - Remove unused BACKLIGHT_CLASS_DEVICE dependencies - Add DT support to lm3630a_bl Bug Fixes: - Fix error path issues in lm3630a_bl" * tag 'backlight-next-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight: backlight: lm3630a: Add firmware node support dt-bindings: backlight: Add lm3630a bindings backlight: lm3630a: Return 0 on success in update_status functions video: lcd: Remove useless BACKLIGHT_CLASS_DEVICE dependencies video: backlight: Remove useless BACKLIGHT_LCD_SUPPORT kernel symbol
2019-05-14Merge tag 'mfd-next-5.2' of ↵Linus Torvalds79-219/+3745
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull MFD updates from Lee Jones: "Core Framework: - Document (kerneldoc) core mfd_add_devices() API New Drivers: - Altera SOCFPGA System Manager - Maxim MAX77650/77651 PMIC - Maxim MAX77663 PMIC - ST Multi-Function eXpander (STMFX) New Device Support: - LEDs support in Intel Cherry Trail Whiskey Cove PMIC - RTC support in SAMSUNG Electronics S2MPA01 PMIC - SAM9X60 support in Atmel HLCDC (High-end LCD Controller) - USB X-Powers AXP 8xx PMICs - Integrated Sensor Hub (ISH) in ChromeOS EC - USB PD Logger in ChromeOS EC - AXP223 in X-Powers AXP series PMICs - Power Supply in X-Powers AXP 803 PMICs - Comet Lake in Intel Low Power Subsystem - Fingerprint MCU in ChromeOS EC - Touchpad MCU in ChromeOS EC - Move TI LM3532 support to LED New Functionality: - max77650, max77620: Add/extend DT support - max77620 power-off - syscon clocking - croc_ec host sleep event Fix-ups: - Trivial; Formatting, spelling, etc; Kconfig, sec-core, ab8500-debugfs - Remove unused functionality; rk808, da9063-* - SPDX conversion; da9063-*, atmel-*, - Adapt/add new register definitions; cs47l35-tables, cs47l90-tables, imx6q-iomuxc-gpr - Fix-up DT bindings; ti-lmu, cirrus,lochnagar - Simply obtaining driver data; ssbi, t7l66xb, tc6387xb, tc6393xb Bug Fixes: - Fix incorrect defined values; max77620, da9063 - Fix device initialisation; twl6040 - Reset device on init; intel-lpss - Fix build warnings when !OF; sun6i-prcm - Register OF match tables; tps65912-spi - Fix DMI matching; intel_quark_i2c_gpio" * tag 'mfd-next-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (65 commits) mfd: Use dev_get_drvdata() directly mfd: cros_ec: Instantiate properly CrOS Touchpad MCU device mfd: cros_ec: Instantiate properly CrOS FP MCU device mfd: cros_ec: Update the EC feature codes mfd: intel-lpss: Add Intel Comet Lake PCI IDs mfd: lochnagar: Add links to binding docs for sound and hwmon mfd: ab8500-debugfs: Fix a typo ("deubgfs") mfd: imx6sx: Add MQS register definition for iomuxc gpr dt-bindings: mfd: LMU: Fix lm3632 dt binding example mfd: intel_quark_i2c_gpio: Adjust IOT2000 matching mfd: da9063: Fix OTP control register names to match datasheets for DA9063/63L mfd: tps65912-spi: Add missing of table registration mfd: axp20x: Add USB power supply mfd cell to AXP803 mfd: sun6i-prcm: Fix build warning for non-OF configurations mfd: intel-lpss: Set the device in reset state when init platform/chrome: Add support for v1 of host sleep event mfd: cros_ec: Add host_sleep_event_v1 command mfd: cros_ec: Instantiate the CrOS USB PD logger driver mfd: cs47l90: Make DAC_AEC_CONTROL_2 readable mfd: cs47l35: Make DAC_AEC_CONTROL_2 readable ...
2019-05-14Merge tag 'pci-v5.2-changes' of ↵Linus Torvalds92-1621/+2911
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "Enumeration changes: - Add _HPX Type 3 settings support, which gives firmware more influence over device configuration (Alexandru Gagniuc) - Support fixed bus numbers from bridge Enhanced Allocation capabilities (Subbaraya Sundeep) - Add "external-facing" DT property to identify cases where we require IOMMU protection against untrusted devices (Jean-Philippe Brucker) - Enable PCIe services for host controller drivers that use managed host bridge alloc (Jean-Philippe Brucker) - Log PCIe port service messages with pci_dev, not the pcie_device (Frederick Lawler) - Convert pciehp from pciehp_debug module parameter to generic dynamic debug (Frederick Lawler) Peer-to-peer DMA: - Add whitelist of Root Complexes that support peer-to-peer DMA between Root Ports (Christian König) Native controller drivers: - Add PCI host bridge DMA ranges for bridges that can't DMA everywhere, e.g., iProc (Srinath Mannam) - Add Amazon Annapurna Labs PCIe host controller driver (Jonathan Chocron) - Fix Tegra MSI target allocation so DMA doesn't generate unwanted MSIs (Vidya Sagar) - Fix of_node reference leaks (Wen Yang) - Fix Hyper-V module unload & device removal issues (Dexuan Cui) - Cleanup R-Car driver (Marek Vasut) - Cleanup Keystone driver (Kishon Vijay Abraham I) - Cleanup i.MX6 driver (Andrey Smirnov) Significant bug fixes: - Reset Lenovo ThinkPad P50 GPU so nouveau works after reboot (Lyude Paul) - Fix Switchtec firmware update performance issue (Wesley Sheng) - Work around Pericom switch link retraining erratum (Stefan Mätje)" * tag 'pci-v5.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (141 commits) MAINTAINERS: Add Karthikeyan Mitran and Hou Zhiqiang for Mobiveil PCI PCI: pciehp: Remove pointless MY_NAME definition PCI: pciehp: Remove pointless PCIE_MODULE_NAME definition PCI: pciehp: Remove unused dbg/err/info/warn() wrappers PCI: pciehp: Log messages with pci_dev, not pcie_device PCI: pciehp: Replace pciehp_debug module param with dyndbg PCI: pciehp: Remove pciehp_debug uses PCI/AER: Log messages with pci_dev, not pcie_device PCI/DPC: Log messages with pci_dev, not pcie_device PCI/PME: Replace dev_printk(KERN_DEBUG) with dev_info() PCI/AER: Replace dev_printk(KERN_DEBUG) with dev_info() PCI: Replace dev_printk(KERN_DEBUG) with dev_info(), etc PCI: Replace printk(KERN_INFO) with pr_info(), etc PCI: Use dev_printk() when possible PCI: Cleanup setup-bus.c comments and whitespace PCI: imx6: Allow asynchronous probing PCI: dwc: Save root bus for driver remove hooks PCI: dwc: Use devm_pci_alloc_host_bridge() to simplify code PCI: dwc: Free MSI in dw_pcie_host_init() error path PCI: dwc: Free MSI IRQ page in dw_pcie_free_msi() ...
2019-05-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds189-2332/+3646
Merge misc updates from Andrew Morton: - a few misc things and hotfixes - ocfs2 - almost all of MM * emailed patches from Andrew Morton <[email protected]>: (139 commits) kernel/memremap.c: remove the unused device_private_entry_fault() export mm: delete find_get_entries_tag mm/huge_memory.c: make __thp_get_unmapped_area static mm/mprotect.c: fix compilation warning because of unused 'mm' variable mm/page-writeback: introduce tracepoint for wait_on_page_writeback() mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags mm/Kconfig: update "Memory Model" help text mm/vmscan.c: don't disable irq again when count pgrefill for memcg mm: memblock: make keeping memblock memory opt-in rather than opt-out hugetlbfs: always use address space in inode for resv_map pointer mm/z3fold.c: support page migration mm/z3fold.c: add structure for buddy handles mm/z3fold.c: improve compression by extending search mm/z3fold.c: introduce helper functions mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig mm/vmscan.c: simplify shrink_inactive_list() fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback xen/privcmd-buf.c: convert to use vm_map_pages_zero() xen/gntdev.c: convert to use vm_map_pages() ...
2019-05-14kernel/memremap.c: remove the unused device_private_entry_fault() exportChristoph Hellwig1-1/+0
This export has been entirely unused since it was added more than 1 1/2 years ago. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: delete find_get_entries_tagMatthew Wilcox (Oracle)2-64/+0
I removed the only user of this and hadn't noticed it was now unused. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/huge_memory.c: make __thp_get_unmapped_area staticBharath Vedartham1-1/+1
__thp_get_unmapped_area is only used in mm/huge_memory.c. Make it static. Tested by building and booting the kernel. Link: http://lkml.kernel.org/r/20190504102353.GA22525@bharath12345-Inspiron-5559 Signed-off-by: Bharath Vedartham <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/mprotect.c: fix compilation warning because of unused 'mm' variableMike Rapoport1-3/+2
Since 0cbe3e26abe0 ("mm: update ptep_modify_prot_start/commit to take vm_area_struct as arg") the only place that uses the local 'mm' variable in change_pte_range() is the call to set_pte_at(). Many architectures define set_pte_at() as macro that does not use the 'mm' parameter, which generates the following compilation warning: CC mm/mprotect.o mm/mprotect.c: In function 'change_pte_range': mm/mprotect.c:42:20: warning: unused variable 'mm' [-Wunused-variable] struct mm_struct *mm = vma->vm_mm; ^~ Fix it by passing vma->mm to set_pte_at() and dropping the local 'mm' variable in change_pte_range(). [[email protected]: fix missed conversions] Link: http://lkml.kernel.org/r/CAPhsuW6wcQgYLHNdBdw6m0YiR4RWsS4XzfpSKU7wBLLeOCTbpw@mail.gmail.comLink: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Song Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/page-writeback: introduce tracepoint for wait_on_page_writeback()Yafang Shao3-10/+28
Recently there have been some hung tasks on our server due to wait_on_page_writeback(), and we want to know the details of this PG_writeback, i.e. this page is writing back to which device. But it is not so convenient to get the details. I think it would be better to introduce a tracepoint for diagnosing the writeback details. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yafang Shao <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flagsYafang Shao1-12/+8
trace_reclaim_flags and trace_shrink_flags are almost the same. We can simplify them to avoid redundant code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yafang Shao <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/Kconfig: update "Memory Model" help textMike Rapoport1-25/+23
The help describing the memory model selection is outdated. It still says that SPARSEMEM is experimental and DISCONTIGMEM is a preferred over SPARSEMEM. Update the help text for the relevant options: * add a generic help for the "Memory Model" prompt * add description for FLATMEM * reduce the description of DISCONTIGMEM and add a deprecation note * prefer SPARSEMEM over DISCONTIGMEM Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/vmscan.c: don't disable irq again when count pgrefill for memcgYafang Shao1-1/+1
We can use __count_memcg_events() directly because this callsite is alreay protected by spin_lock_irq(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yafang Shao <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: memblock: make keeping memblock memory opt-in rather than opt-outMike Rapoport16-23/+19
Most architectures do not need the memblock memory after the page allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the arch Kconfig. Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the logic makes it clear which architectures actually use memblock after system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK to the architectures that are still missing that option. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Cc: Russell King <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Richard Kuo <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Paul Burton <[email protected]> Cc: James Hogan <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Eric Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14hugetlbfs: always use address space in inode for resv_map pointerMike Kravetz2-3/+27
Continuing discussion about 58b6e5e8f1ad ("hugetlbfs: fix memory leak for resv_map") brought up the issue that inode->i_mapping may not point to the address space embedded within the inode at inode eviction time. The hugetlbfs truncate routine handles this by explicitly using inode->i_data. However, code cleaning up the resv_map will still use the address space pointed to by inode->i_mapping. Luckily, private_data is NULL for address spaces in all such cases today but, there is no guarantee this will continue. Change all hugetlbfs code getting a resv_map pointer to explicitly get it from the address space embedded within the inode. In addition, add more comments in the code to indicate why this is being done. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Reported-by: Yufen Yu <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/z3fold.c: support page migrationVitaly Wool1-10/+231
Now that we are not using page address in handles directly, we can make z3fold pages movable to decrease the memory fragmentation z3fold may create over time. This patch starts advertising non-headless z3fold pages as movable and uses the existing kernel infrastructure to implement moving of such pages per memory management subsystem's request. It thus implements 3 required callbacks for page migration: * isolation callback: z3fold_page_isolate(): try to isolate the page by removing it from all lists. Pages scheduled for some activity and mapped pages will not be isolated. Return true if isolation was successful or false otherwise * migration callback: z3fold_page_migrate(): re-check critical conditions and migrate page contents to the new page provided by the memory subsystem. Returns 0 on success or negative error code otherwise * putback callback: z3fold_page_putback(): put back the page if z3fold_page_migrate() for it failed permanently (i. e. not with -EAGAIN code). [[email protected]: z3fold_page_isolate() can be static] Link: http://lkml.kernel.org/r/20190419130924.GA161478@ivb42 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vitaly Wool <[email protected]> Signed-off-by: kbuild test robot <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Uladzislau Rezki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/z3fold.c: add structure for buddy handlesVitaly Wool1-40/+145
For z3fold to be able to move its pages per request of the memory subsystem, it should not use direct object addresses in handles. Instead, it will create abstract handles (3 per page) which will contain pointers to z3fold objects. Thus, it will be possible to change these pointers when z3fold page is moved. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vitaly Wool <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Uladzislau Rezki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/z3fold.c: improve compression by extending searchVitaly Wool1-0/+36
The current z3fold implementation only searches this CPU's page lists for a fitting page to put a new object into. This patch adds quick search for very well fitting pages (i. e. those having exactly the required number of free space) on other CPUs too, before allocating a new page for that object. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vitaly Wool <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Uladzislau Rezki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/z3fold.c: introduce helper functionsVitaly Wool1-84/+100
Patch series "z3fold: support page migration", v2. This patchset implements page migration support and slightly better buddy search. To implement page migration support, z3fold has to move away from the current scheme of handle encoding. i. e. stop encoding page address in handles. Instead, a small per-page structure is created which will contain actual addresses for z3fold objects, while pointers to fields of that structure will be used as handles. Thus, it will be possible to change the underlying addresses to reflect page migration. To support migration itself, 3 callbacks will be implemented: 1: isolation callback: z3fold_page_isolate(): try to isolate the page by removing it from all lists. Pages scheduled for some activity and mapped pages will not be isolated. Return true if isolation was successful or false otherwise 2: migration callback: z3fold_page_migrate(): re-check critical conditions and migrate page contents to the new page provided by the system. Returns 0 on success or negative error code otherwise 3: putback callback: z3fold_page_putback(): put back the page if z3fold_page_migrate() for it failed permanently (i. e. not with -EAGAIN code). To make sure an isolated page doesn't get freed, its kref is incremented in z3fold_page_isolate() and decremented during post-migration compaction, if migration was successful, or by z3fold_page_putback() in the other case. Since the new handle encoding scheme implies slight memory consumption increase, better buddy search (which decreases memory consumption) is included in this patchset. This patch (of 4): Introduce a separate helper function for object allocation, as well as 2 smaller helpers to add a buddy to the list and to get a pointer to the pool from the z3fold header. No functional changes here. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vitaly Wool <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Uladzislau Rezki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>