aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-04-06Merge tag 'for-4.17/dm-changes' of ↵Linus Torvalds27-310/+465
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - DM core passthrough ioctl fix to retain reference to DM table, and that table's block devices, while issuing the ioctl to one of those block devices. - DM core passthrough ioctl fix to _not_ override the fmode_t used to issue the ioctl. Overriding by using the fmode_t that the block device was originally open with during DM table load is a liability. - Add DM core support for secure erase forwarding and update the DM linear and DM striped targets to support them. - A DM core 4.16 stable fix to allow abnormal IO (e.g. discard, write same, write zeroes) for targets that make use of the non-splitting IO variant (as is done for multipath or thinp when layered directly on NVMe). - Allow DM targets to return a payload in response to a DM message that they are sent. This is useful for DM targets that would like to provide statistics data in response to DM messages. - Update DM bufio to support non-power-of-2 block sizes. Numerous other related changes prepare the DM bufio code for this support. - Fix DM crypt to use a bounded amount of memory across the entire system. This is to avoid OOM that can otherwise occur in response to certain pathological IO workloads (e.g. discarding a large DM crypt device). - Add a 'check_at_most_once' feature to the DM verity target to allow verity to be used on mobile devices that have very limited resources. - Fix the DM integrity target to fail early if a keyed algorithm (e.g. HMAC) is to be used but the key isn't set. - Add non-power-of-2 support to the DM unstripe target. - Eliminate the use of a Variable Length Array in the DM stripe target. - Update the DM log-writes target to record metadata (REQ_META flag). - DM raid fixes for its nosync status and some variable range issues. * tag 'for-4.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (28 commits) dm: remove fmode_t argument from .prepare_ioctl hook dm: hold DM table for duration of ioctl rather than use blkdev_get dm raid: fix parse_raid_params() variable range issue dm verity: make verity_for_io_block static dm verity: add 'check_at_most_once' option to only validate hashes once dm bufio: don't embed a bio in the dm_buffer structure dm bufio: support non-power-of-two block sizes dm bufio: use slab cache for dm_buffer structure allocations dm bufio: reorder fields in dm_buffer structure dm bufio: relax alignment constraint on slab cache dm bufio: remove code that merges slab caches dm bufio: get rid of slab cache name allocations dm bufio: move dm-bufio.h to include/linux/ dm bufio: delete outdated comment dm: add support for secure erase forwarding dm: backfill abnormal IO support to non-splitting IO submission dm raid: fix nosync status dm mpath: use DM_MAPIO_SUBMITTED instead of magic number 0 in process_queued_bios() dm stripe: get rid of a Variable Length Array (VLA) dm log writes: record metadata flag for better flags record ...
2018-04-06Merge branch 'work.misc' of ↵Linus Torvalds22-65/+46
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff, including Christoph's I_DIRTY patches" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: move I_DIRTY_INODE to fs.h ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls fs: fold open_check_o_direct into do_dentry_open vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents vfs: make sure struct filename->iname is word-aligned get rid of pointless includes of fs_struct.h [poll] annotate SAA6588_CMD_POLL users
2018-04-06net: phy: marvell: Enable interrupt function on LED2 pinEsben Haabendal1-2/+18
The LED2[2]/INTn pin on Marvell 88E1318S as well as 88E1510/12/14/18 needs to be configured to be usable as interrupt not only when WOL is enabled, but whenever we rely on interrupts from the PHY. Signed-off-by: Esben Haabendal <[email protected]> Cc: Rasmus Villemoes <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-06kvm: x86: fix a prototype warningPeng Hao1-1/+1
Make the function static to avoid a warning: no previous prototype for ‘vmx_enable_tdp’ Signed-off-by: Peng Hao <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-04-06Merge branch '100GbE' of ↵David S. Miller2-3/+5
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2018-04-06 This series contains a couple of fixes for the new ice driver. Wei Yongjun fixes the return error code for error case during init. Anirudh fixes the incorrect use of ARRAY_SIZE() in the ice ethtool code and fixed "for" loop calculations. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-04-06ARM: sa1100/simpad: switch simpad CF to use gpiod APIsRussell King2-9/+14
Switch simpad's CF implementation to use the gpiod APIs. The inverted detection is handled using gpiolib's native inversion abilities. Signed-off-by: Russell King <[email protected]>
2018-04-06ARM: sa1100/shannon: convert to generic CF socketsRussell King6-109/+40
Convert shannon to use the generic CF socket support. Signed-off-by: Russell King <[email protected]>
2018-04-06ARM: sa1100/nanoengine: convert to generic CF socketsRussell King5-138/+23
Convert nanoengine to use the generic CF socket support. Makefile fix from Arnd Bergmann <[email protected]>. Signed-off-by: Russell King <[email protected]>
2018-04-06ice: Bug fixes in ethtool codeAnirudh Venkataramanan1-2/+2
1) Return correct size from ice_get_regs_len. 2) Fix incorrect use of ARRAY_SIZE in ice_get_regs. Fixes: fcea6f3da546 (ice: Add stats and ethtool support) Signed-off-by: Anirudh Venkataramanan <[email protected]> Tested-by: Tony Brelinski <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2018-04-06ice: Fix error return code in ice_init_hw()Wei Yongjun1-1/+3
Fix to return error code ICE_ERR_NO_MEMORY from the alloc error handling case instead of 0, as done elsewhere in this function. Fixes: dc49c7723676 ("ice: Get MAC/PHY/link info and scheduler topology") Signed-off-by: Wei Yongjun <[email protected]> Acked-by: Anirudh Venkataramanan <[email protected]> Tested-by: Tony Brelinski <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2018-04-06Merge remote-tracking branch 'lorenzo/pci/cadence' into nextBjorn Helgaas1-0/+1
* lorenzo/pci/cadence: MAINTAINERS: Add missing /drivers/pci/cadence directory entry
2018-04-06fscache: Maintain a catalogue of allocated cookiesDavid Howells5-119/+279
Maintain a catalogue of allocated cookies so that cookie collisions can be handled properly. For the moment, this just involves printing a warning and returning a NULL cookie to the caller of fscache_acquire_cookie(), but in future it might make sense to wait for the old cookie to finish being cleaned up. This requires the cookie key to be stored attached to the cookie so that we still have the key available if the netfs relinquishes the cookie. This is done by an earlier patch. The catalogue also renders redundant fscache_netfs_list (used for checking for duplicates), so that can be removed. Signed-off-by: David Howells <[email protected]> Acked-by: Anna Schumaker <[email protected]> Tested-by: Steve Dickson <[email protected]>
2018-04-06fscache: Pass object size in rather than calling back for itDavid Howells21-166/+127
Pass the object size in to fscache_acquire_cookie() and fscache_write_page() rather than the netfs providing a callback by which it can be received. This makes it easier to update the size of the object when a new page is written that extends the object. The current object size is also passed by fscache to the check_aux function, obviating the need to store it in the aux data. Signed-off-by: David Howells <[email protected]> Acked-by: Anna Schumaker <[email protected]> Tested-by: Steve Dickson <[email protected]>
2018-04-05mm,oom_reaper: check for MMF_OOM_SKIP before complainingTetsuo Handa1-1/+2
I got "oom_reaper: unable to reap pid:" messages when the victim thread was blocked inside free_pgtables() (which occurred after returning from unmap_vmas() and setting MMF_OOM_SKIP). We don't need to complain when exit_mmap() already set MMF_OOM_SKIP. Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB oom_reaper: unable to reap pid:7558 (a.out) a.out D13272 7558 6931 0x00100084 Call Trace: schedule+0x2d/0x80 rwsem_down_write_failed+0x2bb/0x440 call_rwsem_down_write_failed+0x13/0x20 down_write+0x49/0x60 unlink_file_vma+0x28/0x50 free_pgtables+0x36/0x100 exit_mmap+0xbb/0x180 mmput+0x50/0x110 copy_process.part.41+0xb61/0x1fe0 _do_fork+0xe6/0x560 do_syscall_64+0x74/0x230 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tetsuo Handa <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/ksm: fix interaction with THPClaudio Imbrenda1-0/+28
This patch fixes a corner case for KSM. When two pages belong or belonged to the same transparent hugepage, and they should be merged, KSM fails to split the page, and therefore no merging happens. This bug can be reproduced by: * making sure ksm is running (in case disabling ksmtuned) * enabling transparent hugepages * allocating a THP-aligned 1-THP-sized buffer e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21) * filling it with the same values e.g. memset(p, 42, 1<<21) * performing madvise to make it mergeable e.g. madvise(p, 1<<21, MADV_MERGEABLE) * waiting for KSM to perform a few scans The expected outcome is that the all the pages get merged (1 shared and the rest sharing); the actual outcome is that no pages get merged (1 unshared and the rest volatile) The reason of this behaviour is that we increase the reference count once for both pages we want to merge, but if they belong to the same hugepage (or compound page), the reference counter used in both cases is the one of the head of the compound page. This means that split_huge_page will find a value of the reference counter too high and will fail. This patch solves this problem by testing if the two pages to merge belong to the same hugepage when attempting to merge them. If so, the hugepage is split safely. This means that the hugepage is not split if not necessary. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Claudio Imbrenda <[email protected]> Co-authored-by: Gerald Schaefer <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Christian Borntraeger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/memblock.c: cast constant ULLONG_MAX to phys_addr_tStefan Agner1-2/+2
This fixes a warning shown when phys_addr_t is 32-bit int when compiling with clang: mm/memblock.c:927:15: warning: implicit conversion from 'unsigned long long' to 'phys_addr_t' (aka 'unsigned int') changes value from 18446744073709551615 to 4294967295 [-Wconstant-conversion] r->base : ULLONG_MAX; ^~~~~~~~~~ ./include/linux/kernel.h:30:21: note: expanded from macro 'ULLONG_MAX' #define ULLONG_MAX (~0ULL) ^~~~~ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Agner <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Ard Biesheuvel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05headers: untangle kmemleak.h from mm.hRandy Dunlap23-14/+11
Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious reason. It looks like it's only a convenience, so remove kmemleak.h from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that don't already #include it. Also remove <linux/kmemleak.h> from source files that do not use it. This is tested on i386 allmodconfig and x86_64 allmodconfig. It would be good to run it through the 0day bot for other $ARCHes. I have neither the horsepower nor the storage space for the other $ARCHes. Update: This patch has been extensively build-tested by both the 0day bot & kisskb/ozlabs build farms. Both of them reported 2 build failures for which patches are included here (in v2). [ slab.h is the second most used header file after module.h; kernel.h is right there with slab.h. There could be some minor error in the counting due to some #includes having comments after them and I didn't combine all of those. ] [[email protected]: security/keys/big_key.c needs vmalloc.h, per sfr] Link: http://lkml.kernel.org/r/[email protected] Link: http://kisskb.ellerman.id.au/kisskb/head/13396/ Signed-off-by: Randy Dunlap <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Reported-by: Michael Ellerman <[email protected]> [2 build failures] Reported-by: Fengguang Wu <[email protected]> [2 build failures] Reviewed-by: Andrew Morton <[email protected]> Cc: Wei Yongjun <[email protected]> Cc: Luis R. Rodriguez <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mimi Zohar <[email protected]> Cc: John Johansen <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05include/linux/mmdebug.h: make VM_WARN* non-rvalsMichal Hocko1-4/+4
At present the construct if (VM_WARN(...)) will compile OK with CONFIG_DEBUG_VM=y and will fail with CONFIG_DEBUG_VM=n. The reason is that VM_{WARN,BUG}* have always been special wrt. {WARN/BUG}* and never generate any code when DEBUG_VM is disabled. So we cannot really use it in conditionals. We considered changing things so that this construct works in both cases but that might cause unwanted code generation with CONFIG_DEBUG_VM=n. It is safer and simpler to make the build fail in both cases. [[email protected]: changelog] Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/page_isolation.c: make start_isolate_page_range() fail if already isolatedMike Kravetz2-5/+21
start_isolate_page_range() is used to set the migrate type of a set of pageblocks to MIGRATE_ISOLATE while attempting to start a migration operation. It assumes that only one thread is calling it for the specified range. This routine is used by CMA, memory hotplug and gigantic huge pages. Each of these users synchronize access to the range within their subsystem. However, two subsystems (CMA and gigantic huge pages for example) could attempt operations on the same range. If this happens, one thread may 'undo' the work another thread is doing. This can result in pageblocks being incorrectly left marked as MIGRATE_ISOLATE and therefore not available for page allocation. What is ideally needed is a way to synchronize access to a set of pageblocks that are undergoing isolation and migration. The only thing we know about these pageblocks is that they are all in the same zone. A per-node mutex is too coarse as we want to allow multiple operations on different ranges within the same zone concurrently. Instead, we will use the migration type of the pageblocks themselves as a form of synchronization. start_isolate_page_range sets the migration type on a set of page- blocks going in order from the one associated with the smallest pfn to the largest pfn. The zone lock is acquired to check and set the migration type. When going through the list of pageblocks check if MIGRATE_ISOLATE is already set. If so, this indicates another thread is working on this pageblock. We know exactly which pageblocks we set, so clean up by undo those and return -EBUSY. This allows start_isolate_page_range to serve as a synchronization mechanism and will allow for more general use of callers making use of these interfaces. Update comments in alloc_contig_range to reflect this new functionality. Each CPU holds the associated zone lock to modify or examine the migration type of a pageblock. And, it will only examine/update a single pageblock per lock acquire/release cycle. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm: change return type to vm_fault_tSouptick Joarder2-4/+45
The plan for these patches is to introduce the typedef, initially just as documentation ("These functions should return a VM_FAULT_ status"). We'll trickle the patches to individual drivers/filesystems in through the maintainers, as far as possible. Then we'll change the typedef to an unsigned int and break the compilation of any unconverted drivers/filesystems. vmf_insert_page(), vmf_insert_mixed() and vmf_insert_pfn() are three newly added functions. The various drivers/filesystems where return value of fault(), huge_fault(), page_mkwrite() and pfn_mkwrite() get converted, will need them. These functions will return correct VM_FAULT_ code based on err value. We've had bugs before where drivers returned -EFOO. And we have this silly inefficiency where vm_insert_xxx() return an errno which (afaict) every driver then converts into a VM_FAULT code. In many cases drivers failed to return correct VM_FAULT code value despite of vm_insert_xxx() fails. We have indentified and clean up all those existing bugs and silly inefficiencies in driver/filesystems by adding these three new inline wrappers. As mentioned above, we will trickle those patches to individual drivers/filesystems in through maintainers after these three wrapper functions are merged. Eventually we can convert vm_insert_xxx() into vmf_insert_xxx() and remove these inline wrappers, but these are a good intermediate step. Link: http://lkml.kernel.org/r/20180310162351.GA7422@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm, oom: remove 3% bonus for CAP_SYS_ADMIN processesDavid Rientjes1-7/+0
Since the 2.6 kernel, the oom killer has slightly biased away from CAP_SYS_ADMIN processes by discounting some of its memory usage in comparison to other processes. This has always been implicit and nothing exactly relies on the behavior. Gaurav notices that __task_cred() can dereference a potentially freed pointer if the task under consideration is exiting because a reference to the task_struct is not held. Remove the CAP_SYS_ADMIN bias so that all processes are treated equally. If any CAP_SYS_ADMIN process would like to be biased against, it is always allowed to adjust /proc/pid/oom_score_adj. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Reported-by: Gaurav Kohli <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memoryDavid Rientjes5-25/+45
Kswapd will not wakeup if per-zone watermarks are not failing or if too many previous attempts at background reclaim have failed. This can be true if there is a lot of free memory available. For high- order allocations, kswapd is responsible for waking up kcompactd for background compaction. If the zone is not below its watermarks or reclaim has recently failed (lots of free memory, nothing left to reclaim), kcompactd does not get woken up. When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be woken up even if kswapd will not reclaim. This allows high-order allocations, such as thp, to still trigger background compaction even when the zone has an abundance of free memory. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05kernel/fork.c: detect early free of a live mmMark Rutland2-0/+3
KASAN splats indicate that in some cases we free a live mm, then continue to access it, with potentially disastrous results. This is likely due to a mismatched mmdrop() somewhere in the kernel, but so far the culprit remains elusive. Let's have __mmdrop() verify that the mm isn't live for the current task, similar to the existing check for init_mm. This way, we can catch this class of issue earlier, and without requiring KASAN. Currently, idle_task_exit() leaves active_mm stale after it switches to init_mm. This isn't harmful, but will trigger the new assertions, so we must adjust idle_task_exit() to update active_mm. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mark Rutland <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm: make counting of list_lru_one::nr_items locklessKirill Tkhai2-23/+47
During the reclaiming slab of a memcg, shrink_slab iterates over all registered shrinkers in the system, and tries to count and consume objects related to the cgroup. In case of memory pressure, this behaves bad: I observe high system time and time spent in list_lru_count_one() for many processes on RHEL7 kernel. This patch makes list_lru_node::memcg_lrus rcu protected, that allows to skip taking spinlock in list_lru_count_one(). Shakeel Butt with the patch observes significant perf graph change. He says: ======================================================================== Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu VM and recording the trace with 'perf record -g -a'. The trace without the patch: + 34.19% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath + 30.77% fb.sh [kernel.kallsyms] [k] _raw_spin_lock + 3.53% fb.sh [kernel.kallsyms] [k] list_lru_count_one + 2.26% fb.sh [kernel.kallsyms] [k] super_cache_count + 1.68% fb.sh [kernel.kallsyms] [k] shrink_slab + 0.59% fb.sh [kernel.kallsyms] [k] down_read_trylock + 0.48% fb.sh [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore + 0.38% fb.sh [kernel.kallsyms] [k] shrink_node_memcg + 0.32% fb.sh [kernel.kallsyms] [k] queue_work_on + 0.26% fb.sh [kernel.kallsyms] [k] count_shadow_nodes With the patch: + 0.16% swapper [kernel.kallsyms] [k] default_idle + 0.13% oom_reaper [kernel.kallsyms] [k] mutex_spin_on_owner + 0.05% perf [kernel.kallsyms] [k] copy_user_generic_string + 0.05% init.real [kernel.kallsyms] [k] wait_consider_task + 0.05% kworker/0:0 [kernel.kallsyms] [k] finish_task_switch + 0.04% kworker/2:1 [kernel.kallsyms] [k] finish_task_switch + 0.04% kworker/3:1 [kernel.kallsyms] [k] finish_task_switch + 0.04% kworker/1:0 [kernel.kallsyms] [k] finish_task_switch + 0.03% binary [kernel.kallsyms] [k] copy_page ======================================================================== Thanks Shakeel for the testing. [[email protected]: v2] Link: http://lkml.kernel.org/r/151203869520.3915.2587549826865799173.stgit@localhost.localdomain Link: http://lkml.kernel.org/r/150583358557.26700.8490036563698102569.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <[email protected]> Tested-by: Shakeel Butt <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() staticColin Ian King1-3/+3
The bool enable_vma_readahead and swap_vma_readahead() are local to the source and do not need to be in global scope, so make them static. Cleans up sparse warnings: mm/swap_state.c:41:6: warning: symbol 'enable_vma_readahead' was not declared. Should it be static? mm/swap_state.c:742:13: warning: symbol 'swap_vma_readahead' was not declared. Should it be static? Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: "Huang, Ying" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05block_invalidatepage(): only release page if the full page was invalidatedJeff Moyer1-1/+1
Prior to commit d47992f86b30 ("mm: change invalidatepage prototype to accept length"), an offset of 0 meant that the full page was being invalidated. After that commit, we need to instead check the length. Jan said: : : The only possible issue is that try_to_release_page() was called more : often than necessary. Otherwise the issue is harmless but still it's good : to have this fixed. Link: http://lkml.kernel.org/r/[email protected] Fixes: d47992f86b307 ("mm: change invalidatepage prototype to accept length") Signed-off-by: Jeff Moyer <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Lukas Czerner <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm: kernel-doc: add missing parameter descriptionsMike Rapoport8-0/+30
Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/swap.c: remove @cold parameter description for release_pages()Mike Rapoport1-1/+0
The 'cold' parameter was removed from release_pages function by commit c6f92f9fbe7d ("mm: remove cold parameter for release_pages"). Update the description to match the code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/nommu: remove description of alloc_vm_areaMike Rapoport1-12/+0
The alloc_mm_area in nommu is a stub, but its description states it allocates kernel address space. Remove the description to make the code and the documentation agree. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05zram: drop max_zpage_size and use zs_huge_class_size()Sergey Senozhatsky2-17/+8
Remove ZRAM's enforced "huge object" value and use zsmalloc huge-class watermark instead, which makes more sense. TEST - I used a 1G zram device, LZO compression back-end, original data set size was 444MB. Looking at zsmalloc classes stats the test ended up to be pretty fair. BASE ZRAM/ZSMALLOC ===================== zram mm_stat 498978816 191482495 199831552 0 199831552 15634 0 zsmalloc classes class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 151 2448 0 0 1240 1240 744 3 0 168 2720 0 0 4200 4200 2800 2 0 190 3072 0 0 10100 10100 7575 3 0 202 3264 0 0 380 380 304 4 0 254 4096 0 0 10620 10620 10620 1 0 Total 7 46 106982 106187 48787 0 PATCHED ZRAM/ZSMALLOC ===================== zram mm_stat 498978816 182579184 194248704 0 194248704 15628 0 zsmalloc classes class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 151 2448 0 0 1240 1240 744 3 0 168 2720 0 0 4200 4200 2800 2 0 190 3072 0 0 10100 10100 7575 3 0 202 3264 0 0 7180 7180 5744 4 0 254 4096 0 0 3820 3820 3820 1 0 Total 8 45 106959 106193 47424 0 As we can see, we reduced the number of objects stored in class-4096, because a huge number of objects which we previously forcibly stored in class-4096 now stored in non-huge class-3264. This results in lower memory consumption: - zsmalloc now uses 47424 physical pages, which is less than 48787 pages zsmalloc used before. - objects that we store in class-3264 share zspages. That's why overall the number of pages that both class-4096 and class-3264 consumed went down from 10924 to 9564. [[email protected]: add pool param to zs_huge_class_size()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05zsmalloc: introduce zs_huge_class_size()Sergey Senozhatsky2-0/+43
Patch series "zsmalloc/zram: drop zram's max_zpage_size", v3. ZRAM's max_zpage_size is a bad thing. It forces zsmalloc to store normal objects as huge ones, which results in bigger zsmalloc memory usage. Drop it and use actual zsmalloc huge-class value when decide if the object is huge or not. This patch (of 2): Not every object can be share its zspage with other objects, e.g. when the object is as big as zspage or nearly as big a zspage. For such objects zsmalloc has a so called huge class - every object which belongs to huge class consumes the entire zspage (which consists of a physical page). On x86_64, PAGE_SHIFT 12 box, the first non-huge class size is 3264, so starting down from size 3264, objects can share page(-s) and thus minimize memory wastage. ZRAM, however, has its own statically defined watermark for huge objects, namely "3 * PAGE_SIZE / 4 = 3072", and forcibly stores every object larger than this watermark (3072) as a PAGE_SIZE object, in other words, to a huge class, while zsmalloc can keep some of those objects in non-huge classes. This results in increased memory consumption. zsmalloc knows better if the object is huge or not. Introduce zs_huge_class_size() function which tells if the given object can be stored in one of non-huge classes or not. This will let us to drop ZRAM's huge object watermark and fully rely on zsmalloc when we decide if the object is huge. [[email protected]: add pool param to zs_huge_class_size()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm: fix races between swapoff and flush dcacheHuang Ying19-26/+38
Thanks to commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"), after swapoff the address_space associated with the swap device will be freed. So page_mapping() users which may touch the address_space need some kind of mechanism to prevent the address_space from being freed during accessing. The dcache flushing functions (flush_dcache_page(), etc) in architecture specific code may access the address_space of swap device for anonymous pages in swap cache via page_mapping() function. But in some cases there are no mechanisms to prevent the swap device from being swapoff, for example, CPU1 CPU2 __get_user_pages() swapoff() flush_dcache_page() mapping = page_mapping() ... exit_swap_address_space() ... kvfree(spaces) mapping_mapped(mapping) The address space may be accessed after being freed. But from cachetlb.txt and Russell King, flush_dcache_page() only care about file cache pages, for anonymous pages, flush_anon_page() should be used. The implementation of flush_dcache_page() in all architectures follows this too. They will check whether page_mapping() is NULL and whether mapping_mapped() is true to determine whether to flush the dcache immediately. And they will use interval tree (mapping->i_mmap) to find all user space mappings. While mapping_mapped() and mapping->i_mmap isn't used by anonymous pages in swap cache at all. So, to fix the race between swapoff and flush dcache, __page_mapping() is add to return the address_space for file cache pages and NULL otherwise. All page_mapping() invoking in flush dcache functions are replaced with page_mapping_file(). [[email protected]: simplify page_mapping_file(), per Mike] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Chen Liqin <[email protected]> Cc: Russell King <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05fs/direct-io.c: minor cleanups in do_blockdev_direct_IONikolay Borisov1-5/+4
We already get the block counts and calculate the end block at the beginning of the function. Let's use the local variables for consistency and readability. No functional changes [[email protected]: constify the locals to prevent future slipups] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: Jeff Moyer <[email protected]> Cc: Al Viro <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05include/linux/mm.h: provide consistent declaration for num_poisoned_pagesGuenter Roeck1-1/+1
clang reports the following compile warning. In file included from mm/vmscan.c:56: ./include/linux/swapops.h:327:22: warning: section attribute is specified on redeclared variable [-Wsection] extern atomic_long_t num_poisoned_pages __read_mostly; ^ ./include/linux/mm.h:2585:22: note: previous declaration is here extern atomic_long_t num_poisoned_pages; ^ Let's use __read_mostly everywhere. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Guenter Roeck <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05device-dax: implement ->pagesize() for smaps to report MMUPageSizeDan Williams1-0/+10
Given that device-dax is making similar page mapping size guarantees as hugetlbfs, emit the size in smaps and any other kernel path that requests the mapping size of a vma. Link: http://lkml.kernel.org/r/151996255287.27922.18397777516059080245.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Reported-by: Jane Chu <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm, hugetlbfs: introduce ->pagesize() to vm_operations_structDan Williams2-8/+12
When device-dax is operating in huge-page mode we want it to behave like hugetlbfs and report the MMU page mapping size that is being enforced by the vma. Similar to commit 31383c6865a5 "mm, hugetlbfs: introduce ->split() to vm_operations_struct" it would be messy to teach vma_mmu_pagesize() about device-dax page mapping sizes in the same (hstate) way that hugetlbfs communicates this attribute. Instead, these patches introduce a new ->pagesize() vm operation. Link: http://lkml.kernel.org/r/151996254734.27922.15813097401404359642.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Reported-by: Jane Chu <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm, powerpc: use vma_kernel_pagesize() in vma_mmu_pagesize()Dan Williams3-15/+4
Patch series "mm, smaps: MMUPageSize for device-dax", v3. Similar to commit 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct") here is another occasion where we want special-case hugetlbfs/hstate enabling to also apply to device-dax. This prompts the question what other hstate conversions we might do beyond ->split() and ->pagesize(), but this appears to be the last of the usages of hstate_vma() in generic/non-hugetlbfs specific code paths. This patch (of 3): The current powerpc definition of vma_mmu_pagesize() open codes looking up the page size via hstate. It is identical to the generic vma_kernel_pagesize() implementation. Now, vma_kernel_pagesize() is growing support for determining the page size of Device-DAX vmas in addition to the existing Hugetlbfs page size determination. Ideally, if the powerpc vma_mmu_pagesize() used vma_kernel_pagesize() it would automatically benefit from any new vma-type support that is added to vma_kernel_pagesize(). However, the powerpc vma_mmu_pagesize() is prevented from calling vma_kernel_pagesize() due to a circular header dependency that requires vma_mmu_pagesize() to be defined before including <linux/hugetlb.h>. Break this circular dependency by defining the default vma_mmu_pagesize() as a __weak symbol to be overridden by the powerpc version. Link: http://lkml.kernel.org/r/151996254179.27922.2213728278535578744.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Jane Chu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/gup.c: fix coding style issues.Mario Leinweber1-2/+2
- Fixed style error: 8 spaces -> 1 tab. - Fixed style warning: Corrected misleading indentation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mario Leinweber <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/free_pcppages_bulk: prefetch buddy while not holding lockAaron Lu1-0/+22
When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Since we *always* do buddy merging and check an order-0 page's buddy to try to merge it when it goes into the main allocator, the cacheline will always come in, i.e. the prefetched data will never be unused. Normally, the number of prefetch will be pcp->batch(default=31 and has an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of pcp's pages get all drained, it will be pcp->count which has an upper limit of pcp->high. pcp->high, although has a default value of 186 (pcp->batch=31 * 6), can be changed by user through /proc/sys/vm/percpu_pagelist_fraction and there is no software upper limit so could be large, like several thousand. For this reason, only the first pcp->batch number of page's buddy structure is prefetched to avoid excessive prefetching. In the meantime, there are two concerns: 1. the prefetch could potentially evict existing cachelines, especially for L1D cache since it is not huge 2. there is some additional instruction overhead, namely calculating buddy pfn twice For 1, it's hard to say, this microbenchmark though shows good result but the actual benefit of this patch will be workload/CPU dependant; For 2, since the calculation is a XOR on two local variables, it's expected in many cases that cycles spent will be offset by reduced memory latency later. This is especially true for NUMA machines where multiple CPUs are contending on zone->lock and the most time consuming part under zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages and their buddies. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9% Note: this patch's performance improvement percent is against patch2/3. (Changelog stolen from Dave Hansen and Mel Gorman's comments at http://lkml.kernel.org/r/[email protected]) [[email protected]: use helper function, avoid disordering pages] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: v4] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aaron Lu <[email protected]> Suggested-by: Ying Huang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Kemi Wang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tim Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/free_pcppages_bulk: do not hold lock when picking pages to freeAaron Lu1-16/+23
When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the zone->lock is held and then pages are chosen from PCP's migratetype list. While there is actually no need to do this 'choose part' under lock since it's PCP pages, the only CPU that can touch them is us and irq is also disabled. Moving this part outside could reduce lock held time and improve performance. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% What the test does is: starts $nr_cpu processes and each will repeatedly do the following for 5 minutes: - mmap 128M anonymouse space - write access to that space - munmap. The score is the aggregated iteration. https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aaron Lu <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Huang Ying <[email protected]> Cc: Kemi Wang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Tim Chen <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/free_pcppages_bulk: update pcp->count insideAaron Lu1-7/+3
Matthew Wilcox found that all callers of free_pcppages_bulk() currently update pcp->count immediately after so it's natural to do it inside free_pcppages_bulk(). No functionality or performance change is expected from this patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aaron Lu <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Huang Ying <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kemi Wang <[email protected]> Cc: Tim Chen <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm, compaction: drain pcps for zone when kcompactd failsDavid Rientjes1-0/+8
It's possible for free pages to become stranded on per-cpu pagesets (pcps) that, if drained, could be merged with buddy pages on the zone's free area to form large order pages, including up to MAX_ORDER. Consider a verbose example using the tools/vm/page-types tool at the beginning of a ZONE_NORMAL ('B' indicates a buddy page and 'S' indicates a slab page). Pages on pcps do not have any page flags set. 109954 1 _______S________________________________________________________ 109955 2 __________B_____________________________________________________ 109957 1 ________________________________________________________________ 109958 1 __________B_____________________________________________________ 109959 7 ________________________________________________________________ 109960 1 __________B_____________________________________________________ 109961 9 ________________________________________________________________ 10996a 1 __________B_____________________________________________________ 10996b 3 ________________________________________________________________ 10996e 1 __________B_____________________________________________________ 10996f 1 ________________________________________________________________ ... 109f8c 1 __________B_____________________________________________________ 109f8d 2 ________________________________________________________________ 109f8f 2 __________B_____________________________________________________ 109f91 f ________________________________________________________________ 109fa0 1 __________B_____________________________________________________ 109fa1 7 ________________________________________________________________ 109fa8 1 __________B_____________________________________________________ 109fa9 1 ________________________________________________________________ 109faa 1 __________B_____________________________________________________ 109fab 1 _______S________________________________________________________ The compaction migration scanner is attempting to defragment this memory since it is at the beginning of the zone. It has done so quite well, all movable pages have been migrated. From pfn [0x109955, 0x109fab), there are only buddy pages and pages without flags set. These pages may be stranded on pcps that could otherwise allow this memory to be coalesced if freed back to the zone free area. It is possible that some of these pages may not be on pcps and that something has called alloc_pages() and used the memory directly, but we rely on the absence of __GFP_MOVABLE in these cases to allocate from MIGATE_UNMOVABLE pageblocks to try to keep these MIGRATE_MOVABLE pageblocks as free as possible. These buddy and pcp pages, spanning 1,621 pages, could be coalesced and allow for three transparent hugepages to be dynamically allocated. Running the numbers for all such spans on the system, it was found that there were over 400 such spans of only buddy pages and pages without flags set at the time this /proc/kpageflags sample was collected. Without this support, there were _no_ order-9 or order-10 pages free. When kcompactd fails to defragment memory such that a cc.order page can be allocated, drain all pcps for the zone back to the buddy allocator so this stranding cannot occur. Compaction for that order will subsequently be deferred, which acts as a ratelimit on this drain. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm: make should_failslab always available for fault injectionHoward McLauchlan3-3/+12
should_failslab() is a convenient function to hook into for directed error injection into kmalloc(). However, it is only available if a config flag is set. The following BCC script, for example, fails kmalloc() calls after a btrfs umount: from bcc import BPF prog = r""" BPF_HASH(flag); #include <linux/mm.h> int kprobe__btrfs_close_devices(void *ctx) { u64 key = 1; flag.update(&key, &key); return 0; } int kprobe__should_failslab(struct pt_regs *ctx) { u64 key = 1; u64 *res; res = flag.lookup(&key); if (res != 0) { bpf_override_return(ctx, -ENOMEM); } return 0; } """ b = BPF(text=prog) while 1: b.kprobe_poll() This patch refactors the should_failslab implementation so that the function is always available for error injection, independent of flags. This change would be similar in nature to commit f5490d3ec921 ("block: Add should_fail_bio() for bpf error injection"). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Howard McLauchlan <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Akinobu Mita <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Alexei Starovoitov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/page_poison.c: make early_page_poison_param() __initDou Liyang1-1/+1
The early_param() is only called during kernel initialization, So Linux marks the function of it with __init macro to save memory. But it forgot to mark the early_page_poison_param(). So, Make it __init as well. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dou Liyang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Philippe Ombredanne <[email protected]> Cc: Kate Stewart <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/page_owner.c: make early_page_owner_param() __initDou Liyang1-1/+1
The early_param() is only called during kernel initialization, So Linux marks the functions of it with __init macro to save memory. But it forgot to mark the early_page_owner_param(). So, Make it __init as well. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dou Liyang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm/kmemleak.c: make kmemleak_boot_config() __initDou Liyang1-1/+1
The early_param() is only called during kernel initialization, So Linux marks the functions of it with __init macro to save memory. But it forgot to mark the kmemleak_boot_config(). So, Make it __init as well. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dou Liyang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Catalin Marinas <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm: swap: unify cluster-based and vma-based swap readaheadMinchan Kim4-38/+53
This patch makes do_swap_page() not need to be aware of two different swap readahead algorithms. Just unify cluster-based and vma-based readahead function call. Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm: swap: clean up swap readaheadMinchan Kim3-77/+62
When I see recent change of swap readahead, I am very unhappy about current code structure which diverges two swap readahead algorithm in do_swap_page. This patch is to clean it up. Main motivation is that fault handler doesn't need to be aware of readahead algorithms but just should call swapin_readahead. As first step, this patch cleans up a little bit but not perfect (I just separate for review easier) so next patch will make the goal complete. [[email protected]: do not check readahead flag with THP anon] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05mm,vmscan: don't pretend forward progress upon shrinker_rwsem contentionTetsuo Handa1-9/+1
Since we no longer use return value of shrink_slab() for normal reclaim, the comment is no longer true. If some do_shrink_slab() call takes unexpectedly long (root cause of stall is currently unknown) when register_shrinker()/unregister_shrinker() is pending, trying to drop caches via /proc/sys/vm/drop_caches could become infinite cond_resched() loop if many mem_cgroup are defined. For safety, let's not pretend forward progress. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tetsuo Handa <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Glauber Costa <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-04-05z3fold: limit use of stale list for allocationVitaly Wool1-16/+19
Currently if z3fold couldn't find an unbuddied page it would first try to pull a page off the stale list. The problem with this approach is that we can't 100% guarantee that the page is not processed by the workqueue thread at the same time unless we run cancel_work_sync() on it, which we can't do if we're in an atomic context. So let's just limit stale list usage to non-atomic contexts only. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vitaly Vul <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>