aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2015-02-11tg3: Hold tp->lock before calling tg3_halt() from tg3_init_one()Jun'ichi Nomura \(NEC\)1-0/+2
tg3_init_one() calls tg3_halt() without tp->lock despite its assumption and causes deadlock. If lockdep is enabled, a warning like this shows up before the stall: [ BUG: bad unlock balance detected! ] 3.19.0test #3 Tainted: G E ------------------------------------- insmod/369 is trying to release lock (&(&tp->lock)->rlock) at: [<ffffffffa02d5a1d>] tg3_chip_reset+0x14d/0x780 [tg3] but there are no more locks to release! tg3_init_one() doesn't call tg3_halt() under normal situation but during kexec kdump I hit this problem. Fixes: 932f19de ("tg3: Release tp->lock before invoking synchronize_irq()") Signed-off-by: Jun'ichi Nomura <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-02-11Merge tag 'wireless-drivers-for-davem-2015-02-11' of ↵David S. Miller1-4/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers rtlwifi: * remove superfluous warning message which is not needed anymore Signed-off-by: David S. Miller <[email protected]>
2015-02-11bgmac: fix device initialization on Northstar SoCs (condition typo)Rafał Miłecki1-2/+3
On Northstar (Broadcom's ARM architecture) we need to manually enable all cores. Code for that is already in place, but the condition for it was wrong. Signed-off-by: Rafał Miłecki <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-02-11qlcnic: Delete existing multicast MAC list before adding newShahed Shaikh3-13/+51
Driver keeps adding multicast addresses without deleting removed MACs and worrying about adapters filter limit. This results into actual count of programmed multicast addresses get accumulated over the time and overruns the adapter's filter limit without putting device in ACCEPT_ALL_MULTI mode. This causes newly added multicast traffic to fail after the sequence of addition - deletion in certain pattern. This issue is seen only when netdev's mcast list count is less than adapters mcast filter limit. e.g. If adapters multicast filter limit is 38 per function then following sequence would result in multicast traffic failure for newly added MACs. - add less than 38 multicast MACs - remove previously added multicast MACs - add new multicast MACs (less than 38) Signed-off-by: Shahed Shaikh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-02-11net/mlx5_core: Fix configuration of log_uar_page_szEli Cohen1-0/+1
The current code failed to configure the page size for architectures with page size different than 4K - PPC for example. Signed-off-by: Carol L Soto <[email protected]> Signed-off-by: Eli Cohen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-02-11sunvnet: don't change gso data on clonesDavid L Stevens1-13/+10
This patch unclones an skb for the case where the sunvnet driver needs to change the segmentation size so that it doesn't interfere with TCP SACK's use of them. Signed-off-by: David L Stevens <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-02-11drivers/net: Use setup_timer and mod_timerVaishali Thakkar1-5/+2
This patch introduces the use of functions setup_timer and mod_timer. This is done using Coccinelle and semantic patch used for this as follows: // <smpl> @@ expression x,y,z,a,b; @@ -init_timer (&x); +setup_timer (&x, y, z); +mod_timer (&a, b); -x.function = y; -x.data = z; -x.expires = b; -add_timer(&a); // </smpl> Signed-off-by: Vaishali Thakkar <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-02-11drivers: net: xgene: Make xgene_enet_of_match depend on CONFIG_OFGeert Uytterhoeven1-0/+2
If CONFIG_NET_XGENE=y but CONFIG_OF=n: drivers/net/ethernet/apm/xgene/xgene_enet_main.c:1033: warning: ‘xgene_enet_of_match’ defined but not used Signed-off-by: Geert Uytterhoeven <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-02-11et131x: use msecs_to_jiffies for conversionsNicholas Mc Guire1-2/+4
This is only an API consolidation and should make things more readable. Converting milliseconds to jiffies by "val * HZ / 1000" is technically OK but msecs_to_jiffies(val) is the cleaner solution and handles all corner cases correctly. This is a minor API cleanup only. Signed-off-by: Nicholas Mc Guire <[email protected]> Acked-by: Mark Einon <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-02-11Merge remote-tracking branch 'grant/devicetree/next' into for-nextRob Herring1-15/+0
2015-02-12md/raid10: fix conversion from RAID0 to RAID10NeilBrown1-3/+9
A RAID0 array (like a LINEAR array) does not have a concept of 'size' being the amount of each device that is in use. Rather, as much of each device as is available is used. So the 'size' is set to 0 and ignored. RAID10 does have this concept and needs it to be set correctly. So when we convert RAID0 to RAID10 we must determine the 'size' (that being the size of the first 'strip_zone' in the RAID0), and set it correctly. Reported-and-tested-by: Xiao Ni <[email protected]> Signed-off-by: NeilBrown <[email protected]>
2015-02-11Merge branch 'akpm' (patches from Andrew)Linus Torvalds116-1717/+2491
Merge second set of updates from Andrew Morton: "More of MM" * emailed patches from Andrew Morton <[email protected]>: (83 commits) mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() vmstat: Reduce time interval to stat update on idle cpu mm/page_owner.c: remove unnecessary stack_trace field Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files mm: incorporate read-only pages into transparent huge pages vmstat: do not use deferrable delayed work for vmstat_update mm: more aggressive page stealing for UNMOVABLE allocations mm: always steal split buddies in fallback allocations mm: when stealing freepages, also take pages created by splitting buddy page mincore: apply page table walker on do_mincore() mm: /proc/pid/clear_refs: avoid split_huge_page() mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP) mempolicy: apply page table walker on queue_pages_range() arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() memcg: cleanup preparation for page table walk numa_maps: remove numa_maps->vma numa_maps: fix typo in gather_hugetbl_stats pagemap: use walk->vma instead of calling find_vma() clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() ...
2015-02-11Merge tag 'powerpc-3.20-1' of ↵Linus Torvalds237-3143/+5084
git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux Pull powerpc updates from Michael Ellerman: - Update of all defconfigs - Addition of a bunch of config options to modernise our defconfigs - Some PS3 updates from Geoff - Optimised memcmp for 64 bit from Anton - Fix for kprobes that allows 'perf probe' to work from Naveen - Several cxl updates from Ian & Ryan - Expanded support for the '24x7' PMU from Cody & Sukadev - Freescale updates from Scott: "Highlights include 8xx optimizations, some more work on datapath device tree content, e300 machine check support, t1040 corenet error reporting, and various cleanups and fixes" * tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (102 commits) cxl: Add missing return statement after handling AFU errror cxl: Fail AFU initialisation if an invalid configuration record is found cxl: Export optional AFU configuration record in sysfs powerpc/mm: Warn on flushing tlb page in kernel context powerpc/powernv: Add OPAL soft-poweroff routine powerpc/perf/hv-24x7: Document sysfs event description entries powerpc/perf/hv-gpci: add the remaining gpci requests powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated powerpc/perf/hv-24x7: parse catalog and populate sysfs with events perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper perf: add PMU_EVENT_ATTR_STRING() helper perf: provide sysfs_show for struct perf_pmu_events_attr powerpc/kernel: Avoid initializing device-tree pointer twice powerpc: Remove old compile time disabled syscall tracing code powerpc/kernel: Make syscall_exit a local label cxl: Fix device_node reference counting powerpc/mm: bail out early when flushing TLB page powerpc: defconfigs: add MTD_SPI_NOR (new dependency for M25P80) perf/powerpc: reset event hw state when adding it to the PMU powerpc/qe: Use strlcpy() ...
2015-02-11Merge tag 'arm64-upstream' of ↵Linus Torvalds73-919/+1533
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "arm64 updates for 3.20: - reimplementation of the virtual remapping of UEFI Runtime Services in a way that is stable across kexec - emulation of the "setend" instruction for 32-bit tasks (user endianness switching trapped in the kernel, SCTLR_EL1.E0E bit set accordingly) - compat_sys_call_table implemented in C (from asm) and made it a constant array together with sys_call_table - export CPU cache information via /sys (like other architectures) - DMA API implementation clean-up in preparation for IOMMU support - macros clean-up for KVM - dropped some unnecessary cache+tlb maintenance - CONFIG_ARM64_CPU_SUSPEND clean-up - defconfig update (CPU_IDLE) The EFI changes going via the arm64 tree have been acked by Matt Fleming. There is also a patch adding sys_*stat64 prototypes to include/linux/syscalls.h, acked by Andrew Morton" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (47 commits) arm64: compat: Remove incorrect comment in compat_siginfo arm64: Fix section mismatch on alloc_init_p[mu]d() arm64: Avoid breakage caused by .altmacro in fpsimd save/restore macros arm64: mm: use *_sect to check for section maps arm64: drop unnecessary cache+tlb maintenance arm64:mm: free the useless initial page table arm64: Enable CPU_IDLE in defconfig arm64: kernel: remove ARM64_CPU_SUSPEND config option arm64: make sys_call_table const arm64: Remove asm/syscalls.h arm64: Implement the compat_sys_call_table in C syscalls: Declare sys_*stat64 prototypes if __ARCH_WANT_(COMPAT_)STAT64 compat: Declare compat_sys_sigpending and compat_sys_sigprocmask prototypes arm64: uapi: expose our struct ucontext to the uapi headers smp, ARM64: Kill SMP single function call interrupt arm64: Emulate SETEND for AArch32 tasks arm64: Consolidate hotplug notifier for instruction emulation arm64: Track system support for mixed endian EL0 arm64: implement generic IOMMU configuration arm64: Combine coherent and non-coherent swiotlb dma_ops ...
2015-02-11Merge branch 'for-linus' of ↵Linus Torvalds82-1030/+1640
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: - The remaining patches for the z13 machine support: kernel build option for z13, the cache synonym avoidance, SMT support, compare-and-delay for spinloops and the CES5S crypto adapater. - The ftrace support for function tracing with the gcc hotpatch option. This touches common code Makefiles, Steven is ok with the changes. - The hypfs file system gets an extension to access diagnose 0x0c data in user space for performance analysis for Linux running under z/VM. - The iucv hvc console gets wildcard spport for the user id filtering. - The cacheinfo code is converted to use the generic infrastructure. - Cleanup and bug fixes. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits) s390/process: free vx save area when releasing tasks s390/hypfs: Eliminate hypfs interval s390/hypfs: Add diagnose 0c support s390/cacheinfo: don't use smp_processor_id() in preemptible context s390/zcrypt: fixed domain scanning problem (again) s390/smp: increase maximum value of NR_CPUS to 512 s390/jump label: use different nop instruction s390/jump label: add sanity checks s390/mm: correct missing space when reporting user process faults s390/dasd: cleanup profiling s390/dasd: add locking for global_profile access s390/ftrace: hotpatch support for function tracing ftrace: let notrace function attribute disable hotpatching if necessary ftrace: allow architectures to specify ftrace compile options s390: reintroduce diag 44 calls for cpu_relax() s390/zcrypt: Add support for new crypto express (CEX5S) adapter. s390/zcrypt: Number of supported ap domains is not retrievable. s390/spinlock: add compare-and-delay to lock wait loops s390/tape: remove redundant if statement s390/hvc_iucv: add simple wildcard matches to the iucv allow filter ...
2015-02-11Merge tag 'please-pull-pstore' of ↵Linus Torvalds9-25/+193
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull pstore update from Tony Luck: "Miscellaneous fs/pstore fixes" * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: pstore: Fix sprintf format specifier in pstore_dump() pstore: Add pmsg - user-space accessible pstore object pstore: Handle zero-sized prz in series pstore: Remove superfluous memory size check pstore: Use scnprintf() in pstore_mkfile()
2015-02-11Merge tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds60-1729/+5236
Pull NFS client updates from Trond Myklebust: "Highlights incluse: Features: - Removing the forced serialisation of open()/close() calls in NFSv4.x (x>0) makes for a significant performance improvement in metadata intensive workloads. - Full support for the pNFS "flexible files" layout type - Further RPC/RDMA client improvements from Chuck Bugfixes: - Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING - Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL - Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called as part of the namespace cleanup, - Stable fix: Ensure we reference the inode for return-on-close in delegreturn - Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to the same source address/port combination during a disconnect/ reconnect event. This is a requirement imposed by most NFSv3 server duplicate reply cache implementations. Optimisations: - Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT Other: - Add Anna Schumaker as co-maintainer for the NFS client" * tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (119 commits) SUNRPC: Cleanup to remove xs_tcp_close() pnfs: delete an unintended goto pnfs/flexfiles: Do not dprintk after the free SUNRPC: Fix stupid typo in xs_sock_set_reuseport SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG SUNRPC: Handle connection reset more efficiently. SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT SUNRPC: Remove TCP socket linger code SUNRPC: Remove TCP client connection reset hack SUNRPC: TCP/UDP always close the old socket before reconnecting SUNRPC: Add helpers to prevent socket create from racing SUNRPC: Ensure xs_reset_transport() resets the close connection flags SUNRPC: Do not clear the source port in xs_reset_transport SUNRPC: Handle EADDRINUSE on connect SUNRPC: Set SO_REUSEPORT socket option for TCP connections NFSv4.1: Fix pnfs_put_lseg races NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS ...
2015-02-12Merge branch 'cpuidle' of ↵Rafael J. Wysocki1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux into pm-cpuidle Pull intel_idle update for v3.20 from Len Brown. * 'cpuidle' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: intel_idle: support additional Broadwell model
2015-02-12Merge branch 'turbostat' of ↵Rafael J. Wysocki2-158/+254
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux into pm-tools Pull additional turbostat updates for v3.20 from Len Brown. * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: tools/power turbostat: support additional Broadwell model tools/power turbostat: update parameters, documentation tools/power turbostat: Skip printing disabled package C-states
2015-02-12PM / devfreq: event: testing the wrong variableDan Carpenter1-2/+2
There is a typo here so we test "edev" but we intended to test "edev[i]". Fixes: f262f28c1470 ('PM / devfreq: event: Add devfreq_event class') Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Chanwoo Choi <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2015-02-11mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()Roman Gushchin1-2/+2
I noticed that "allowed" can easily overflow by falling below 0, because (total_vm / 32) can be larger than "allowed". The problem occurs in OVERCOMMIT_NONE mode. In this case, a huge allocation can success and overcommit the system (despite OVERCOMMIT_NONE mode). All subsequent allocations will fall (system-wide), so system become unusable. The problem was masked out by commit c9b1d0981fcc ("mm: limit growth of 3% hardcoded other user reserve"), but it's easy to reproduce it on older kernels: 1) set overcommit_memory sysctl to 2 2) mmap() large file multiple times (with VM_SHARED flag) 3) try to malloc() large amount of memory It also can be reproduced on newer kernels, but miss-configured sysctl_user_reserve_kbytes is required. Fix this issue by switching to signed arithmetic here. Signed-off-by: Roman Gushchin <[email protected]> Cc: Andrew Shewmaker <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()Roman Gushchin1-2/+2
I noticed, that "allowed" can easily overflow by falling below 0, because (total_vm / 32) can be larger than "allowed". The problem occurs in OVERCOMMIT_NONE mode. In this case, a huge allocation can success and overcommit the system (despite OVERCOMMIT_NONE mode). All subsequent allocations will fall (system-wide), so system become unusable. The problem was masked out by commit c9b1d0981fcc ("mm: limit growth of 3% hardcoded other user reserve"), but it's easy to reproduce it on older kernels: 1) set overcommit_memory sysctl to 2 2) mmap() large file multiple times (with VM_SHARED flag) 3) try to malloc() large amount of memory It also can be reproduced on newer kernels, but miss-configured sysctl_user_reserve_kbytes is required. Fix this issue by switching to signed arithmetic here. [[email protected]: use min_t] Signed-off-by: Roman Gushchin <[email protected]> Cc: Andrew Shewmaker <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11vmstat: Reduce time interval to stat update on idle cpuChristoph Lameter1-2/+2
It was noted that the vm stat shepherd runs every 2 seconds and that the vmstat update is then scheduled 2 seconds in the future. This yields an interval of double the time interval which is not desired. Change the shepherd so that it does not delay the vmstat update on the other cpu. We stil have to use schedule_delayed_work since we are using a delayed_work_struct but we can set the delay to 0. Signed-off-by: Christoph Lameter <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vinayak Menon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm/page_owner.c: remove unnecessary stack_trace fieldSergei Rogachev2-13/+15
Page owner uses the page_ext structure to keep meta-information for every page in the system. The structure also contains a field of type 'struct stack_trace', page owner uses this field during invocation of the function save_stack_trace. It is easy to notice that keeping a copy of this structure for every page in the system is very inefficiently in terms of memory. The patch removes this unnecessary field of page_ext and forces page owner to use a stack_trace structure allocated on the stack. [[email protected]: use struct initializers] Signed-off-by: Sergei Rogachev <[email protected]> Acked-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11Documentation/filesystems/proc.txt: describe /proc/<pid>/map_filesCyrill Gorcunov1-0/+23
[[email protected]: tweaks] Signed-off-by: Cyrill Gorcunov <[email protected]> Cc: Kees Cook <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Calvin Owens <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: incorporate read-only pages into transparent huge pagesEbru Akagunduz1-13/+42
This patch aims to improve THP collapse rates, by allowing THP collapse in the presence of read-only ptes, like those left in place by do_swap_page after a read fault. Currently THP can collapse 4kB pages into a THP when there are up to khugepaged_max_ptes_none pte_none ptes in a 2MB range. This patch applies the same limit for read-only ptes. The patch was tested with a test program that allocates 800MB of memory, writes to it, and then sleeps. I force the system to swap out all but 190MB of the program by touching other memory. Afterwards, the test program does a mix of reads and writes to its memory, and the memory gets swapped back in. Without the patch, only the memory that did not get swapped out remained in THPs, which corresponds to 24% of the memory of the program. The percentage did not increase over time. With this patch, after 5 minutes of waiting khugepaged had collapsed 50% of the program's memory back into THPs. Test results: With the patch: After swapped out: cat /proc/pid/smaps: Anonymous: 100464 kB AnonHugePages: 100352 kB Swap: 699540 kB Fraction: 99,88 cat /proc/meminfo: AnonPages: 1754448 kB AnonHugePages: 1716224 kB Fraction: 97,82 After swapped in: In a few seconds: cat /proc/pid/smaps: Anonymous: 800004 kB AnonHugePages: 145408 kB Swap: 0 kB Fraction: 18,17 cat /proc/meminfo: AnonPages: 2455016 kB AnonHugePages: 1761280 kB Fraction: 71,74 In 5 minutes: cat /proc/pid/smaps Anonymous: 800004 kB AnonHugePages: 407552 kB Swap: 0 kB Fraction: 50,94 cat /proc/meminfo: AnonPages: 2456872 kB AnonHugePages: 2023424 kB Fraction: 82,35 Without the patch: After swapped out: cat /proc/pid/smaps: Anonymous: 190660 kB AnonHugePages: 190464 kB Swap: 609344 kB Fraction: 99,89 cat /proc/meminfo: AnonPages: 1740456 kB AnonHugePages: 1667072 kB Fraction: 95,78 After swapped in: cat /proc/pid/smaps: Anonymous: 800004 kB AnonHugePages: 190464 kB Swap: 0 kB Fraction: 23,80 cat /proc/meminfo: AnonPages: 2350032 kB AnonHugePages: 1667072 kB Fraction: 70,93 I waited 10 minutes the fractions did not change without the patch. Signed-off-by: Ebru Akagunduz <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Zhang Yanfei <[email protected]> Acked-by: Andrea Arcangeli <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11vmstat: do not use deferrable delayed work for vmstat_updateMichal Hocko1-1/+1
Vinayak Menon has reported that an excessive number of tasks was throttled in the direct reclaim inside too_many_isolated() because NR_ISOLATED_FILE was relatively high compared to NR_INACTIVE_FILE. However it turned out that the real number of NR_ISOLATED_FILE was 0 and the per-cpu vm_stat_diff wasn't transferred into the global counter. vmstat_work which is responsible for the sync is defined as deferrable delayed work which means that the defined timeout doesn't wake up an idle CPU. A CPU might stay in an idle state for a long time and general effort is to keep such a CPU in this state as long as possible which might lead to all sorts of troubles for vmstat consumers as can be seen with the excessive direct reclaim throttling. This patch basically reverts 39bf6270f524 ("VM statistics: Make timer deferrable") but it shouldn't cause any problems for idle CPUs because only CPUs with an active per-cpu drift are woken up since 7cc36bbddde5 ("vmstat: on-demand vmstat workers v8") and CPUs which are idle for a longer time shouldn't have per-cpu drift. Fixes: 39bf6270f524 (VM statistics: Make timer deferrable) Signed-off-by: Michal Hocko <[email protected]> Reported-by: Vinayak Menon <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: more aggressive page stealing for UNMOVABLE allocationsVlastimil Babka1-4/+14
When allocation falls back to stealing free pages of another migratetype, it can decide to steal extra pages, or even the whole pageblock in order to reduce fragmentation, which could happen if further allocation fallbacks pick a different pageblock. In try_to_steal_freepages(), one of the situations where extra pages are stolen happens when we are trying to allocate a MIGRATE_RECLAIMABLE page. However, MIGRATE_UNMOVABLE allocations are not treated the same way, although spreading such allocation over multiple fallback pageblocks is arguably even worse than it is for RECLAIMABLE allocations. To minimize fragmentation, we should minimize the number of such fallbacks, and thus steal as much as is possible from each fallback pageblock. Note that in theory this might put more pressure on movable pageblocks and cause movable allocations to steal back from unmovable pageblocks. However, movable allocations are not as aggressive with stealing, and do not cause permanent fragmentation, so the tradeoff is reasonable, and evaluation seems to support the change. This patch thus adds a check for MIGRATE_UNMOVABLE to the decision to steal extra free pages. When evaluating with stress-highalloc from mmtests, this has reduced the number of MIGRATE_UNMOVABLE fallbacks to roughly 1/6. The number of these fallbacks stealing from MIGRATE_MOVABLE block is reduced to 1/3. There was no observation of growing number of unmovable pageblocks over time, and also not of increased movable allocation fallbacks. Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Zhang Yanfei <[email protected]> Cc: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: always steal split buddies in fallback allocationsVlastimil Babka1-33/+29
When allocation falls back to another migratetype, it will steal a page with highest available order, and (depending on this order and desired migratetype), it might also steal the rest of free pages from the same pageblock. Given the preference of highest available order, it is likely that it will be higher than the desired order, and result in the stolen buddy page being split. The remaining pages after split are currently stolen only when the rest of the free pages are stolen. This can however lead to situations where for MOVABLE allocations we split e.g. order-4 fallback UNMOVABLE page, but steal only order-0 page. Then on the next MOVABLE allocation (which may be batched to fill the pcplists) we split another order-3 or higher page, etc. By stealing all pages that we have split, we can avoid further stealing. This patch therefore adjusts the page stealing so that buddy pages created by split are always stolen. This has effect only on MOVABLE allocations, as RECLAIMABLE and UNMOVABLE allocations already always do that in addition to stealing the rest of free pages from the pageblock. The change also allows to simplify try_to_steal_freepages() and factor out CMA handling. According to Mel, it has been intended since the beginning that buddy pages after split would be stolen always, but it doesn't seem like it was ever the case until commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added"). The commit has unintentionally introduced this behavior, but was reverted by commit 0cbef29a7821 ("mm: __rmqueue_fallback() should respect pageblock type"). Neither included evaluation. My evaluation with stress-highalloc from mmtests shows about 2.5x reduction of page stealing events for MOVABLE allocations, without affecting the page stealing events for other allocation migratetypes. Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Zhang Yanfei <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: when stealing freepages, also take pages created by splitting buddy pageVlastimil Babka2-10/+9
When studying page stealing, I noticed some weird looking decisions in try_to_steal_freepages(). The first I assume is a bug (Patch 1), the following two patches were driven by evaluation. Testing was done with stress-highalloc of mmtests, using the mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how often page stealing occurs for individual migratetypes, and what migratetypes are used for fallbacks. Arguably, the worst case of page stealing is when UNMOVABLE allocation steals from MOVABLE pageblock. RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal, so the goal is to minimize these two cases. The evaluation of v2 wasn't always clear win and Joonsoo questioned the results. Here I used different baseline which includes RFC compaction improvements from [1]. I found that the compaction improvements reduce variability of stress-highalloc, so there's less noise in the data. First, let's look at stress-highalloc configured to do sync compaction, and how these patches reduce page stealing events during the test. First column is after fresh reboot, other two are reiterations of test without reboot. That was all accumulater over 5 re-iterations (so the benchmark was run 5x3 times with 5 fresh restarts). Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-nothp-1 5-nothp-2 5-nothp-3 Page alloc extfrag event 10264225 8702233 10244125 Extfrag fragmenting 10263271 8701552 10243473 Extfrag fragmenting for unmovable 13595 17616 15960 Extfrag fragmenting unmovable placed with movable 7989 12193 8447 Extfrag fragmenting for reclaimable 658 1840 1817 Extfrag fragmenting reclaimable placed with movable 558 1677 1679 Extfrag fragmenting for movable 10249018 8682096 10225696 With Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-nothp-1 6-nothp-2 6-nothp-3 Page alloc extfrag event 11834954 9877523 9774860 Extfrag fragmenting 11833993 9876880 9774245 Extfrag fragmenting for unmovable 7342 16129 11712 Extfrag fragmenting unmovable placed with movable 4191 10547 6270 Extfrag fragmenting for reclaimable 373 1130 923 Extfrag fragmenting reclaimable placed with movable 302 906 738 Extfrag fragmenting for movable 11826278 9859621 9761610 With Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-nothp-1 7-nothp-2 7-nothp-3 Page alloc extfrag event 4725990 3668793 3807436 Extfrag fragmenting 4725104 3668252 3806898 Extfrag fragmenting for unmovable 6678 7974 7281 Extfrag fragmenting unmovable placed with movable 2051 3829 4017 Extfrag fragmenting for reclaimable 429 1208 1278 Extfrag fragmenting reclaimable placed with movable 369 976 1034 Extfrag fragmenting for movable 4717997 3659070 3798339 With Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-nothp-1 8-nothp-2 8-nothp-3 Page alloc extfrag event 5016183 4700142 3850633 Extfrag fragmenting 5015325 4699613 3850072 Extfrag fragmenting for unmovable 1312 3154 3088 Extfrag fragmenting unmovable placed with movable 1115 2777 2714 Extfrag fragmenting for reclaimable 437 1193 1097 Extfrag fragmenting reclaimable placed with movable 330 969 879 Extfrag fragmenting for movable 5013576 4695266 3845887 In v2 we've seen apparent regression with Patch 1 for unmovable events, this is now gone, suggesting it was indeed noise. Here, each patch improves the situation for unmovable events. Reclaimable is improved by patch 1 and then either the same modulo noise, or perhaps sligtly worse - a small price for unmovable improvements, IMHO. The number of movable allocations falling back to other migratetypes is most noisy, but it's reduced to half at Patch 2 nevertheless. These are least critical as compaction can move them around. If we look at success rates, the patches don't affect them, that didn't change. Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-nothp-1 5-nothp-2 5-nothp-3 Success 1 Min 49.00 ( 0.00%) 42.00 ( 14.29%) 41.00 ( 16.33%) Success 1 Mean 51.00 ( 0.00%) 45.00 ( 11.76%) 42.60 ( 16.47%) Success 1 Max 55.00 ( 0.00%) 51.00 ( 7.27%) 46.00 ( 16.36%) Success 2 Min 53.00 ( 0.00%) 47.00 ( 11.32%) 44.00 ( 16.98%) Success 2 Mean 59.60 ( 0.00%) 50.80 ( 14.77%) 48.20 ( 19.13%) Success 2 Max 64.00 ( 0.00%) 56.00 ( 12.50%) 52.00 ( 18.75%) Success 3 Min 84.00 ( 0.00%) 82.00 ( 2.38%) 78.00 ( 7.14%) Success 3 Mean 85.60 ( 0.00%) 82.80 ( 3.27%) 79.40 ( 7.24%) Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%) Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-nothp-1 6-nothp-2 6-nothp-3 Success 1 Min 49.00 ( 0.00%) 44.00 ( 10.20%) 44.00 ( 10.20%) Success 1 Mean 51.80 ( 0.00%) 46.00 ( 11.20%) 45.80 ( 11.58%) Success 1 Max 54.00 ( 0.00%) 49.00 ( 9.26%) 49.00 ( 9.26%) Success 2 Min 58.00 ( 0.00%) 49.00 ( 15.52%) 48.00 ( 17.24%) Success 2 Mean 60.40 ( 0.00%) 51.80 ( 14.24%) 50.80 ( 15.89%) Success 2 Max 63.00 ( 0.00%) 54.00 ( 14.29%) 55.00 ( 12.70%) Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%) Success 3 Mean 85.00 ( 0.00%) 81.60 ( 4.00%) 79.80 ( 6.12%) Success 3 Max 86.00 ( 0.00%) 82.00 ( 4.65%) 82.00 ( 4.65%) Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-nothp-1 7-nothp-2 7-nothp-3 Success 1 Min 50.00 ( 0.00%) 44.00 ( 12.00%) 39.00 ( 22.00%) Success 1 Mean 52.80 ( 0.00%) 45.60 ( 13.64%) 42.40 ( 19.70%) Success 1 Max 55.00 ( 0.00%) 46.00 ( 16.36%) 47.00 ( 14.55%) Success 2 Min 52.00 ( 0.00%) 48.00 ( 7.69%) 45.00 ( 13.46%) Success 2 Mean 53.40 ( 0.00%) 49.80 ( 6.74%) 48.80 ( 8.61%) Success 2 Max 57.00 ( 0.00%) 52.00 ( 8.77%) 52.00 ( 8.77%) Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%) Success 3 Mean 85.00 ( 0.00%) 82.40 ( 3.06%) 79.60 ( 6.35%) Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%) Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-nothp-1 8-nothp-2 8-nothp-3 Success 1 Min 46.00 ( 0.00%) 44.00 ( 4.35%) 42.00 ( 8.70%) Success 1 Mean 50.20 ( 0.00%) 45.60 ( 9.16%) 44.00 ( 12.35%) Success 1 Max 52.00 ( 0.00%) 47.00 ( 9.62%) 47.00 ( 9.62%) Success 2 Min 53.00 ( 0.00%) 49.00 ( 7.55%) 48.00 ( 9.43%) Success 2 Mean 55.80 ( 0.00%) 50.60 ( 9.32%) 49.00 ( 12.19%) Success 2 Max 59.00 ( 0.00%) 52.00 ( 11.86%) 51.00 ( 13.56%) Success 3 Min 84.00 ( 0.00%) 80.00 ( 4.76%) 79.00 ( 5.95%) Success 3 Mean 85.40 ( 0.00%) 81.60 ( 4.45%) 80.40 ( 5.85%) Success 3 Max 87.00 ( 0.00%) 83.00 ( 4.60%) 82.00 ( 5.75%) While there's no improvement here, I consider reduced fragmentation events to be worth on its own. Patch 2 also seems to reduce scanning for free pages, and migrations in compaction, suggesting it has somewhat less work to do: Patch 1: Compaction stalls 4153 3959 3978 Compaction success 1523 1441 1446 Compaction failures 2630 2517 2531 Page migrate success 4600827 4943120 5104348 Page migrate failure 19763 16656 17806 Compaction pages isolated 9597640 10305617 10653541 Compaction migrate scanned 77828948 86533283 87137064 Compaction free scanned 517758295 521312840 521462251 Compaction cost 5503 5932 6110 Patch 2: Compaction stalls 3800 3450 3518 Compaction success 1421 1316 1317 Compaction failures 2379 2134 2201 Page migrate success 4160421 4502708 4752148 Page migrate failure 19705 14340 14911 Compaction pages isolated 8731983 9382374 9910043 Compaction migrate scanned 98362797 96349194 98609686 Compaction free scanned 496512560 469502017 480442545 Compaction cost 5173 5526 5811 As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers of unmovable and reclaimable pageblocks. Configuring the benchmark to allocate like THP page fault (i.e. no sync compaction) gives much noisier results for iterations 2 and 3 after reboot. This is not so surprising given how [1] offers lower improvements in this scenario due to less restarts after deferred compaction which would change compaction pivot. Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-thp-1 5-thp-2 5-thp-3 Page alloc extfrag event 8148965 6227815 6646741 Extfrag fragmenting 8147872 6227130 6646117 Extfrag fragmenting for unmovable 10324 12942 15975 Extfrag fragmenting unmovable placed with movable 5972 8495 10907 Extfrag fragmenting for reclaimable 601 1707 2210 Extfrag fragmenting reclaimable placed with movable 520 1570 2000 Extfrag fragmenting for movable 8136947 6212481 6627932 Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-thp-1 6-thp-2 6-thp-3 Page alloc extfrag event 8345457 7574471 7020419 Extfrag fragmenting 8343546 7573777 7019718 Extfrag fragmenting for unmovable 10256 18535 30716 Extfrag fragmenting unmovable placed with movable 6893 11726 22181 Extfrag fragmenting for reclaimable 465 1208 1023 Extfrag fragmenting reclaimable placed with movable 353 996 843 Extfrag fragmenting for movable 8332825 7554034 6987979 Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-thp-1 7-thp-2 7-thp-3 Page alloc extfrag event 3512847 3020756 2891625 Extfrag fragmenting 3511940 3020185 2891059 Extfrag fragmenting for unmovable 9017 6892 6191 Extfrag fragmenting unmovable placed with movable 1524 3053 2435 Extfrag fragmenting for reclaimable 445 1081 1160 Extfrag fragmenting reclaimable placed with movable 375 918 986 Extfrag fragmenting for movable 3502478 3012212 2883708 Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-thp-1 8-thp-2 8-thp-3 Page alloc extfrag event 3181699 3082881 2674164 Extfrag fragmenting 3180812 3082303 2673611 Extfrag fragmenting for unmovable 1201 4031 4040 Extfrag fragmenting unmovable placed with movable 974 3611 3645 Extfrag fragmenting for reclaimable 478 1165 1294 Extfrag fragmenting reclaimable placed with movable 387 985 1030 Extfrag fragmenting for movable 3179133 3077107 2668277 The improvements for first iteration are clear, the rest is much noisier and can appear like regression for Patch 1. Anyway, patch 2 rectifies it. Allocation success rates are again unaffected so there's no point in making this e-mail any longer. [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2 This patch (of 3): When __rmqueue_fallback() is called to allocate a page of order X, it will find a page of order Y >= X of a fallback migratetype, which is different from the desired migratetype. With the help of try_to_steal_freepages(), it may change the migratetype (to the desired one) also of: 1) all currently free pages in the pageblock containing the fallback page 2) the fallback pageblock itself 3) buddy pages created by splitting the fallback page (when Y > X) These decisions take the order Y into account, as well as the desired migratetype, with the goal of preventing multiple fallback allocations that could e.g. distribute UNMOVABLE allocations among multiple pageblocks. Originally, decision for 1) has implied the decision for 3). Commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") changed that (probably unintentionally) so that the buddy pages in case 3) are always changed to the desired migratetype, except for CMA pageblocks. Commit fef903efcf0c ("mm/page_allo.c: restructure free-page stealing code and fix a bug") did some refactoring and added a comment that the case of 3) is intended. Commit 0cbef29a7821 ("mm: __rmqueue_fallback() should respect pageblock type") removed the comment and tried to restore the original behavior where 1) implies 3), but due to the previous refactoring, the result is instead that only 2) implies 3) - and the conditions for 2) are less frequently met than conditions for 1). This may increase fragmentation in situations where the code decides to steal all free pages from the pageblock (case 1)), but then gives back the buddy pages produced by splitting. This patch restores the original intended logic where 1) implies 3). During testing with stress-highalloc from mmtests, this has shown to decrease the number of events where UNMOVABLE and RECLAIMABLE allocations steal from MOVABLE pageblocks, which can lead to permanent fragmentation. In some cases it has increased the number of events when MOVABLE allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are fixable by sync compaction and thus less harmful. Note that evaluation has shown that the behavior introduced by 47118af076f6 for buddy pages in case 3) is actually even better than the original logic, so the following patch will introduce it properly once again. For stable backports of this patch it makes thus sense to only fix versions containing 0cbef29a7821. [[email protected]: tracepoint fix] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Zhang Yanfei <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: <[email protected]> [3.13+ containing 0cbef29a7821] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mincore: apply page table walker on do_mincore()Naoya Horiguchi2-126/+60
This patch makes do_mincore() use walk_page_vma(), which reduces many lines of code by using common page table walk code. [[email protected]: remove unneeded variable 'err'] Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Daeseok Youn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: /proc/pid/clear_refs: avoid split_huge_page()Kirill A. Shutemov1-3/+44
Currently pagewalker splits all THP pages on any clear_refs request. It's not necessary. We can handle this on PMD level. One side effect is that soft dirty will potentially see more dirty memory, since we will mark whole THP page dirty at once. Sanity checked with CRIU test suite. More testing is required. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: Cyrill Gorcunov <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)Naoya Horiguchi3-8/+19
walk_page_range() silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_page_range). For example for pagemap_read(), when no callbacks are called against VM_PFNMAP vma, pagemap_read() may prepare pagemap data for next virtual address range at wrong index. That could confuse and/or break userspace applications. This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows: - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole() over vma(VM_PFNMAP), - for clear_refs and queue_pages which have their own ->tests_walk, just return 1 and skip vma(VM_PFNMAP). This is no problem because these are not interested in hole regions, - for other callers, just skip the vma(VM_PFNMAP) as a default behavior. Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Shiraz Hashim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mempolicy: apply page table walker on queue_pages_range()Naoya Horiguchi1-136/+92
queue_pages_range() does page table walking in its own way now, but there is some code duplicate. This patch applies page table walker to reduce lines of code. queue_pages_range() has to do some precheck to determine whether we really walk over the vma or just skip it. Now we have test_walk() callback in mm_walk for this purpose, so we can do this replacement cleanly. queue_pages_test_walk() depends on not only the current vma but also the previous one, so queue_pages->prev is introduced to remember it. Signed-off-by: Naoya Horiguchi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()Naoya Horiguchi1-4/+2
We don't have to use mm_walk->private to pass vma to the callback function because of mm_walk->vma. And walk_page_vma() is useful if we walk over a single vma. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11memcg: cleanup preparation for page table walkNaoya Horiguchi1-33/+16
pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And both of mem_cgroup_count_precharge() and mem_cgroup_move_charge() do for each vma loop themselves, but now it's done in pagewalk.c, so let's clean up them. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11numa_maps: remove numa_maps->vmaNaoya Horiguchi1-16/+13
pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_numa_map() walks pages on vma basis, so using walk_page_vma() is preferable. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11numa_maps: fix typo in gather_hugetbl_statsNaoya Horiguchi1-3/+3
Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code grep-friendly. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11pagemap: use walk->vma instead of calling find_vma()Naoya Horiguchi1-54/+14
Page table walker has the information of the current vma in mm_walk, so we don't have to call find_vma() in each pagemap_(pte|hugetlb)_range() call any longer. Currently pagemap_pte_range() does vma loop itself, so this patch reduces many lines of code. NULL-vma check is omitted because we assume that we never run these callbacks on any address outside vma. And even if it were broken, NULL pointer dereference would be detected, so we can get enough information for debugging. Signed-off-by: Naoya Horiguchi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()Naoya Horiguchi1-24/+22
clear_refs_write() has some prechecks to determine if we really walk over a given vma. Now we have a test_walk() callback to filter vmas, so let's utilize it. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11smaps: remove mem_size_stats->vma and use walk_page_vma()Naoya Horiguchi1-8/+4
pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_smap() walks pages on vma basis, so using walk_page_vma() is preferable. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11pagewalk: add walk_page_vma()Naoya Horiguchi2-0/+19
Introduce walk_page_vma(), which is useful for the callers which want to walk over a given vma. It's used by later patches. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11pagewalk: improve vma handlingNaoya Horiguchi2-92/+129
Current implementation of page table walker has a fundamental problem in vma handling, which started when we tried to handle vma(VM_HUGETLB). Because it's done in pgd loop, considering vma boundary makes code complicated and bug-prone. From the users viewpoint, some user checks some vma-related condition to determine whether the user really does page walk over the vma. In order to solve these, this patch moves vma check outside pgd loop and introduce a new callback ->test_walk(). Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm/pagewalk: remove pgd_entry() and pud_entry()Naoya Horiguchi2-13/+2
Currently no user of page table walker sets ->pgd_entry() or ->pud_entry(), so checking their existence in each loop is just wasting CPU cycle. So let's remove it to reduce overhead. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11proc/pagemap: walk page tables under pte lockKonstantin Khlebnikov1-5/+9
Lockless access to pte in pagemap_pte_range() might race with page migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page(): CPU A (pagemap) CPU B (migration) lock_page() try_to_unmap(page, TTU_MIGRATION...) make_migration_entry() set_pte_at() <read *pte> pte_to_pagemap_entry() remove_migration_ptes() unlock_page() if(is_migration_entry()) migration_entry_to_page() BUG_ON(!PageLocked(page)) Also lockless read might be non-atomic if pte is larger than wordsize. Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes. Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap") Signed-off-by: Konstantin Khlebnikov <[email protected]> Reported-by: Andrey Ryabinin <[email protected]> Reviewed-by: Cyrill Gorcunov <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: <[email protected]> [3.5+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: gup: kvm use get_user_pages_unlockedAndrea Arcangeli3-58/+5
Use the more generic get_user_pages_unlocked which has the additional benefit of passing FAULT_FLAG_ALLOW_RETRY at the very first page fault (which allows the first page fault in an unmapped area to be always able to block indefinitely by being allowed to release the mmap_sem). Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Peter Feiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: gup: use get_user_pages_unlockedAndrea Arcangeli5-22/+10
This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to the page fault in order to release the mmap_sem during the I/O. Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Andres Lagar-Cavilla <[email protected]> Cc: Peter Feiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: gup: use get_user_pages_unlocked within get_user_pages_fastAndrea Arcangeli7-33/+16
This allows the get_user_pages_fast slow path to release the mmap_sem before blocking. Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Andres Lagar-Cavilla <[email protected]> Cc: Peter Feiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: gup: add __get_user_pages_unlocked to customize gup_flagsAndrea Arcangeli3-15/+49
Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION to get a proper -EHWPOSION retval instead of -EFAULT to take a more appropriate action if get_user_pages runs into a memory failure. Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Andres Lagar-Cavilla <[email protected]> Cc: Peter Feiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: gup: add get_user_pages_locked and get_user_pages_unlockedAndrea Arcangeli3-11/+196
FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for reading to reduce the mmap_sem contention (for writing), like while waiting for I/O completion. The problem is that right now practically no get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging that nifty feature. Andres fixed it for the KVM page fault. However get_user_pages_fast remains uncovered, and 99% of other get_user_pages aren't using it either (the only exception being FOLL_NOWAIT in KVM which is really nonblocking and in fact it doesn't even release the mmap_sem). So this patchsets extends the optimization Andres did in the KVM page fault to the whole kernel. It makes most important places (including gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times during I/O. The only few places that remains uncovered are drivers like v4l and other exceptions that tends to work on their own memory and they're not working on random user memory (for example like O_DIRECT that uses gup_fast and is fully covered by this patch). A follow up patch should probably also add a printk_once warning to get_user_pages that should go obsolete and be phased out eventually. The "vmas" parameter of get_user_pages makes it fundamentally incompatible with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the mmap_sem is released). While this is just an optimization, this becomes an absolute requirement for the userfaultfd feature http://lwn.net/Articles/615086/ . The userfaultfd allows to block the page fault, and in order to do so I need to drop the mmap_sem first. So this patch also ensures that all memory where userfaultfd could be registered by KVM, the very first fault (no matter if it is a regular page fault, or a get_user_pages) always has FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken only when the pagetable is already mapped. The second fault attempt after the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry without it. This patch (of 5): We can leverage the VM_FAULT_RETRY functionality in the page fault paths better by using either get_user_pages_locked or get_user_pages_unlocked. The former allows conversion of get_user_pages invocations that will have to pass a "&locked" parameter to know if the mmap_sem was dropped during the call. Example from: down_read(&mm->mmap_sem); do_something() get_user_pages(tsk, mm, ..., pages, NULL); up_read(&mm->mmap_sem); to: int locked = 1; down_read(&mm->mmap_sem); do_something() get_user_pages_locked(tsk, mm, ..., pages, &locked); if (locked) up_read(&mm->mmap_sem); The latter is suitable only as a drop in replacement of the form: down_read(&mm->mmap_sem); get_user_pages(tsk, mm, ..., pages, NULL); up_read(&mm->mmap_sem); into: get_user_pages_unlocked(tsk, mm, ..., pages); Where tsk, mm, the intermediate "..." paramters and "pages" can be any value as before. Just the last parameter of get_user_pages (vmas) must be NULL for get_user_pages_locked|unlocked to be usable (the latter original form wouldn't have been safe anyway if vmas wasn't null, for the former we just make it explicit by dropping the parameter). If vmas is not NULL these two methods cannot be used. Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Reviewed-by: Peter Feiner <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>