aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2016-10-07net: macb: NULL out phydev after removing mdio busNathan Sullivan1-0/+1
To ensure the dev->phydev pointer is not used after becoming invalid in mdiobus_unregister, set it to NULL. This happens when removing the macb driver without first taking its interface down, since unregister_netdev will end up calling macb_close. Signed-off-by: Xander Huff <[email protected]> Signed-off-by: Nathan Sullivan <[email protected]> Signed-off-by: Brad Mouring <[email protected]> Reviewed-by: Moritz Fischer <[email protected]> Acked-by: Nicolas Ferre <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-10-07xen-netback: make sure that hashes are not send to unaware frontendsPaul Durrant1-11/+9
In the case when a frontend only negotiates a single queue with xen- netback it is possible for a skbuff with a s/w hash to result in a hash extra_info segment being sent to the frontend even when no hash algorithm has been configured. (The ndo_select_queue() entry point makes sure the hash is not set if no algorithm is configured, but this entry point is not called when there is only a single queue). This can result in a frontend that is unable to handle extra_info segments being given such a segment, causing it to crash. This patch fixes the problem by clearing the hash in ndo_start_xmit() instead, which is clearly guaranteed to be called irrespective of the number of queues. Signed-off-by: Paul Durrant <[email protected]> Cc: Wei Liu <[email protected]> Acked-by: Wei Liu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-10-07Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversionAlex Sidorenko1-1/+1
Roundrobin runner of team driver uses 'unsigned int' variable to count the number of sent_packets. Later it is passed to a subroutine team_num_to_port_index(struct team *team, int num) as 'num' and when we reach MAXINT (2**31-1), 'num' becomes negative. This leads to using incorrect hash-bucket for port lookup and as a result, packets are dropped. The fix consists of changing 'int num' to 'unsigned int num'. Testing of a fixed kernel shows that there is no packet drop anymore. Signed-off-by: Alex Sidorenko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-10-07Merge branch 'parisc-4.9-1' of ↵Linus Torvalds21-207/+308
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc updates from Helge Deller: "Changes include: - Fix boot of 32bit SMP kernel (initial kernel mapping was too small) - Added hardened usercopy checks - Drop bootmem and switch to memblock and NO_BOOTMEM implementation - Drop the BROKEN_RODATA config option (and thus remove the relevant code from the generic headers and files because parisc was the last architecture which used this config option) - Improve segfault reporting by printing human readable error strings - Various smaller changes, e.g. dwarf debug support for assembly code, update comments regarding copy_user_page_asm, switch to kmalloc_array()" * 'parisc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Increase KERNEL_INITIAL_SIZE for 32-bit SMP kernels parisc: Drop bootmem and switch to memblock parisc: Add hardened usercopy feature parisc: Add cfi_startproc and cfi_endproc to assembly code parisc: Move hpmc stack into page aligned bss section parisc: Fix self-detected CPU stall warnings on Mako machines parisc: Report trap type as human readable string parisc: Update comment regarding implementation of copy_user_page_asm parisc: Use kmalloc_array() in add_system_map_addresses() parisc: Check return value of smp_boot_one_cpu() parisc: Drop BROKEN_RODATA config option
2016-10-07MAINTAINERS: add myself as a maintainer of xen-netbackPaul Durrant1-0/+1
Signed-off-by: Paul Durrant <[email protected]> Cc: Wei Liu <[email protected]> Acked-by: Wei Liu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-10-07Merge branch 'for-linus' of ↵Linus Torvalds2-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32 Pull avr32 update from Hans-Christian Noren Egtvedt. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32: avr32: migrate exception table users off module.h and onto extable.h
2016-10-07ipv6 addrconf: disallow rtr_solicits < -1Maciej Żenczykowski1-1/+3
This disallows setting /proc/sys/net/ipv6/conf/*/router_solicitations to values below -1. -1 continues to mean an unlimited number of retransmits. Note: this depends on 'ipv6 addrconf: remove addrconf_sysctl_hop_limit()' Signed-off-by: Maciej Żenczykowski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-10-07Merge tag 'powerpc-4.9-1' of ↵Linus Torvalds227-2904/+5444
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights: - Major rework of Book3S 64-bit exception vectors (Nicholas Piggin) - Use gas sections for arranging exception vectors et. al. - Large set of TM cleanups and selftests (Cyril Bur) - Enable transactional memory (TM) lazily for userspace (Cyril Bur) - Support for XZ compression in the zImage wrapper (Oliver O'Halloran) - Add support for bpf constant blinding (Naveen N. Rao) - Beginnings of upstream support for PA Semi Nemo motherboards (Darren Stevens) Fixes: - Ensure .mem(init|exit).text are within _stext/_etext (Michael Ellerman) - xmon: Don't use ld on 32-bit (Michael Ellerman) - vdso64: Use double word compare on pointers (Anton Blanchard) - powerpc/nvram: Fix an incorrect partition merge (Pan Xinhui) - powerpc: Fix usage of _PAGE_RO in hugepage (Christophe Leroy) - powerpc/mm: Update FORCE_MAX_ZONEORDER range to allow hugetlb w/4K (Aneesh Kumar K.V) - Fix memory leak in queue_hotplug_event() error path (Andrew Donnellan) - Replay hypervisor maintenance interrupt first (Nicholas Piggin) Various performance optimisations (Anton Blanchard): - Align hot loops of memset() and backwards_memcpy() - During context switch, check before setting mm_cpumask - Remove static branch prediction in atomic{, 64}_add_unless - Only disable HAVE_EFFICIENT_UNALIGNED_ACCESS on POWER7 little endian - Set default CPU type to POWER8 for little endian builds Cleanups & features: - Sparse fixes/cleanups (Daniel Axtens) - Preserve CFAR value on SLB miss caused by access to bogus address (Paul Mackerras) - Radix MMU fixups for POWER9 (Aneesh Kumar K.V) - Support for setting used_(vsr|vr|spe) in sigreturn path (for CRIU) (Simon Guo) - Optimise syscall entry for virtual, relocatable case (Nicholas Piggin) - Optimise MSR handling in exception handling (Nicholas Piggin) - Support for kexec with Radix MMU (Benjamin Herrenschmidt) - powernv EEH fixes (Russell Currey) - Suprise PCI hotplug support for powernv (Gavin Shan) - Endian/sparse fixes for powernv PCI (Gavin Shan) - Defconfig updates (Anton Blanchard) - KVM: PPC: Book3S HV: Migrate pinned pages out of CMA (Balbir Singh) - cxl: Flush PSL cache before resetting the adapter (Frederic Barrat) - cxl: replace loop with for_each_child_of_node(), remove unneeded of_node_put() (Andrew Donnellan) - Fix HV facility unavailable to use correct handler (Nicholas Piggin) - Remove unnecessary syscall trampoline (Nicholas Piggin) - fadump: Fix build break when CONFIG_PROC_VMCORE=n (Michael Ellerman) - Quieten EEH message when no adapters are found (Anton Blanchard) - powernv: Add PHB register dump debugfs handle (Russell Currey) - Use kprobe blacklist for exception handlers & asm functions (Nicholas Piggin) - Document the syscall ABI (Nicholas Piggin) - MAINTAINERS: Update cxl maintainers (Michael Neuling) - powerpc: Remove all usages of NO_IRQ (Michael Ellerman) Minor cleanups: - Andrew Donnellan, Christophe Leroy, Colin Ian King, Cyril Bur, Frederic Barrat, Pan Xinhui, PrasannaKumar Muralidharan, Rui Teng, Simon Guo" * tag 'powerpc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (156 commits) powerpc/bpf: Add support for bpf constant blinding powerpc/bpf: Implement support for tail calls powerpc/bpf: Introduce accessors for using the tmp local stack space powerpc/fadump: Fix build break when CONFIG_PROC_VMCORE=n powerpc: tm: Enable transactional memory (TM) lazily for userspace powerpc/tm: Add TM Unavailable Exception powerpc: Remove do_load_up_transact_{fpu,altivec} powerpc: tm: Rename transct_(*) to ck(\1)_state powerpc: tm: Always use fp_state and vr_state to store live registers selftests/powerpc: Add checks for transactional VSXs in signal contexts selftests/powerpc: Add checks for transactional VMXs in signal contexts selftests/powerpc: Add checks for transactional FPUs in signal contexts selftests/powerpc: Add checks for transactional GPRs in signal contexts selftests/powerpc: Check that signals always get delivered selftests/powerpc: Add TM tcheck helpers in C selftests/powerpc: Allow tests to extend their kill timeout selftests/powerpc: Introduce GPR asm helper header file selftests/powerpc: Move VMX stack frame macros to header file selftests/powerpc: Rework FPU stack placement macros and move to header file selftests/powerpc: Check for VSX preservation across userspace preemption ...
2016-10-07vfs: Remove {get,set,remove}xattr inode operationsAndreas Gruenbacher55-319/+0
These inode operations are no longer used; remove them. Signed-off-by: Andreas Gruenbacher <[email protected]> Signed-off-by: Al Viro <[email protected]>
2016-10-07console: don't prefer first registered if DT specifies stdout-pathPaul Burton3-1/+20
If a device tree specifies a preferred device for kernel console output via the stdout-path or linux,stdout-path chosen node properties or the stdout alias then the kernel ought to honor it & output the kernel console to that device. As it stands, this isn't the case. Whilst we parse the stdout-path properties & set an of_stdout variable from of_alias_scan(), and use that from of_console_check() to determine whether to add a console device as a preferred console whilst registering it, we also prefer the first registered console if no other has been selected at the time of its registration. This means that if a console other than the one the device tree selects via stdout-path is registered first, we will switch to using it & when the stdout-path console is later registered the call to add_preferred_console() via of_console_check() is too late to do anything useful. In practice this seems to mean that we switch to the dummy console device fairly early & see no further console output: Console: colour dummy device 80x25 console [tty0] enabled bootconsole [ns16550a0] disabled Fix this by not automatically preferring the first registered console if one is specified by the device tree. This allows consoles to be registered but not enabled, and once the driver for the console selected by stdout-path calls of_console_check() the driver will be added to the list of preferred consoles before any other console has been enabled. When that console is then registered via register_console() it will be enabled as expected. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Paul Burton <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Paul Burton <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Ivan Delalande <[email protected]> Cc: Thierry Reding <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Jan Kara <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Joe Perches <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Rob Herring <[email protected]> Cc: Frank Rowand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07cred: simpler, 1D supplementary groupsAlexey Dobriyan14-85/+46
Current supplementary groups code can massively overallocate memory and is implemented in a way so that access to individual gid is done via 2D array. If number of gids is <= 32, memory allocation is more or less tolerable (140/148 bytes). But if it is not, code allocates full page (!) regardless and, what's even more fun, doesn't reuse small 32-entry array. 2D array means dependent shifts, loads and LEAs without possibility to optimize them (gid is never known at compile time). All of the above is unnecessary. Switch to the usual trailing-zero-len-array scheme. Memory is allocated with kmalloc/vmalloc() and only as much as needed. Accesses become simpler (LEA 8(gi,idx,4) or even without displacement). Maximum number of gids is 65536 which translates to 256KB+8 bytes. I think kernel can handle such allocation. On my usual desktop system with whole 9 (nine) aux groups, struct group_info shrinks from 148 bytes to 44 bytes, yay! Nice side effects: - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing, - fix little mess in net/ipv4/ping.c should have been using GROUP_AT macro but this point becomes moot, - aux group allocation is persistent and should be accounted as such. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexey Dobriyan <[email protected]> Cc: Vasily Kulikov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07CREDITS: update Pavel's information, add GPG key, remove snail mail addressPavel Machek1-4/+4
Link: http://lkml.kernel.org/r/20161003082312.GA20634@amd Signed-off-by: Pavel Machek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mailmap: add Johan HovoldJohan Hovold1-0/+2
Add two entries to map to my primary address. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johan Hovold <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07.gitattributes: set git diff driver for C source code filesJean Delvare1-0/+2
Git can be told to apply language-specific rules when generating diffs. Enable this for C source code files (*.c and *.h) so that function names are printed right. Specifically, doing so prevents "git diff" from mistakenly considering unindented goto labels as function names. Link: http://lkml.kernel.org/r/20160907143403.1449324f@endymion Signed-off-by: Jean Delvare <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Joe Perches <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07uprobes: remove function declarations from arch/{mips,s390}Marcin Nowakowski2-22/+0
The declarations of arch-specific functions have been moved to a common header in commit 3820b4d2789f ('uprobes: Move function declarations out of arch'), but MIPS and S390 has added them to their own trees later. Remove the unnecessary duplicates. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Marcin Nowakowski <[email protected]> Acked-by: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Ralf Baechle <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07spelling.txt: "modeled" is spelt correctlyJoe Perches1-1/+0
No need to correct the correct. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07nmi_backtrace: generate one-line reports for idle cpusChris Metcalf49-22/+93
When doing an nmi backtrace of many cores, most of which are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN". We do this by grouping all the cpuidle code together into a new .cpuidle.text section, and then checking the address of the interrupted PC to see if it lies within that section. This commit suitably tags x86 and tile idle routines, and only adds in the minimal framework for other architectures. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chris Metcalf <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Daniel Thompson <[email protected]> [arm] Tested-by: Petr Mladek <[email protected]> Cc: Aaron Tomlin <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07arch/tile: adopt the new nmi_backtrace frameworkChris Metcalf4-63/+27
Previously tile was rolling its own method of capturing backtrace data in the NMI handlers, but it was relying on running printk() from the NMI handler, which is not always safe. So adopt the nmi_backtrace model (with the new cpumask extension) instead. So we can call the nmi_backtrace code directly from the nmi handler, move the nmi_enter()/exit() into the top-level tile NMI handler. The semantics of the routine change slightly since it is now synchronous with the remote cores completing the backtraces. Previously it was asynchronous, but with protection to avoid starting a new remote backtrace if the old one was still in progress. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chris Metcalf <[email protected]> Cc: Daniel Thompson <[email protected]> [arm] Cc: Petr Mladek <[email protected]> Cc: Aaron Tomlin <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07nmi_backtrace: do a local dump_stack() instead of a self-NMIChris Metcalf2-9/+9
Currently on arm there is code that checks whether it should call dump_stack() explicitly, to avoid trying to raise an NMI when the current context is not preemptible by the backtrace IPI. Similarly, the forthcoming arch/tile support uses an IPI mechanism that does not support generating an NMI to self. Accordingly, move the code that guards this case into the generic mechanism, and invoke it unconditionally whenever we want a backtrace of the current cpu. It seems plausible that in all cases, dump_stack() will generate better information than generating a stack from the NMI handler. The register state will be missing, but that state is likely not particularly helpful in any case. Or, if we think it is helpful, we should be capturing and emitting the current register state in all cases when regs == NULL is passed to nmi_cpu_backtrace(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chris Metcalf <[email protected]> Tested-by: Daniel Thompson <[email protected]> [arm] Reviewed-by: Petr Mladek <[email protected]> Acked-by: Aaron Tomlin <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07nmi_backtrace: add more trigger_*_cpu_backtrace() methodsChris Metcalf10-39/+72
Patch series "improvements to the nmi_backtrace code" v9. This patch series modifies the trigger_xxx_backtrace() NMI-based remote backtracing code to make it more flexible, and makes a few small improvements along the way. The motivation comes from the task isolation code, where there are scenarios where we want to be able to diagnose a case where some cpu is about to interrupt a task-isolated cpu. It can be helpful to see both where the interrupting cpu is, and also an approximation of where the cpu that is being interrupted is. The nmi_backtrace framework allows us to discover the stack of the interrupted cpu. I've tested that the change works as desired on tile, and build-tested x86, arm, mips, and sparc64. For x86 I confirmed that the generic cpuidle stuff as well as the architecture-specific routines are in the new cpuidle section. For arm, mips, and sparc I just build-tested it and made sure the generic cpuidle routines were in the new cpuidle section, but I didn't attempt to figure out which the platform-specific idle routines might be. That might be more usefully done by someone with platform experience in follow-up patches. This patch (of 4): Currently you can only request a backtrace of either all cpus, or all cpus but yourself. It can also be helpful to request a remote backtrace of a single cpu, and since we want that, the logical extension is to support a cpumask as the underlying primitive. This change modifies the existing lib/nmi_backtrace.c code to take a cpumask as its basic primitive, and modifies the linux/nmi.h code to use the new "cpumask" method instead. The existing clients of nmi_backtrace (arm and x86) are converted to using the new cpumask approach in this change. The other users of the backtracing API (sparc64 and mips) are converted to use the cpumask approach rather than the all/allbutself approach. The mips code ignored the "include_self" boolean but with this change it will now also dump a local backtrace if requested. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chris Metcalf <[email protected]> Tested-by: Daniel Thompson <[email protected]> [arm] Reviewed-by: Aaron Tomlin <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: David Miller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07min/max: remove sparse warnings when they're nestedJohannes Berg1-20/+28
Currently, when min/max are nested within themselves, sparse will warn: warning: symbol '_min1' shadows an earlier one originally declared here warning: symbol '_min1' shadows an earlier one originally declared here warning: symbol '_min2' shadows an earlier one originally declared here This also immediately happens when min3() or max3() are used. Since sparse implements __COUNTER__, we can use __UNIQUE_ID() to generate unique variable names, avoiding this. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07Documentation/filesystems/proc.txt: add more description for maps/smapsRobert Ho1-0/+12
Add some more description on the limitations for smaps/maps readings, as well as some guaruntees we can make. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Robert Ho <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Xiao Guangrong <[email protected]> Cc: Robert Hu <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Dan Williams <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Stefan Hajnoczi <[email protected]> Cc: Ross Zwisler <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm, proc: fix region lost in /proc/self/smapsRobert Ho1-3/+5
Recently, Redhat reported that nvml test suite failed on QEMU/KVM, more detailed info please refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1365721 Actually, this bug is not only for NVDIMM/DAX but also for any other file systems. This simple test case abstracted from nvml can easily reproduce this bug in common environment: -------------------------- testcase.c ----------------------------- int is_pmem_proc(const void *addr, size_t len) { const char *caddr = addr; FILE *fp; if ((fp = fopen("/proc/self/smaps", "r")) == NULL) { printf("!/proc/self/smaps"); return 0; } int retval = 0; /* assume false until proven otherwise */ char line[PROCMAXLEN]; /* for fgets() */ char *lo = NULL; /* beginning of current range in smaps file */ char *hi = NULL; /* end of current range in smaps file */ int needmm = 0; /* looking for mm flag for current range */ while (fgets(line, PROCMAXLEN, fp) != NULL) { static const char vmflags[] = "VmFlags:"; static const char mm[] = " wr"; /* check for range line */ if (sscanf(line, "%p-%p", &lo, &hi) == 2) { if (needmm) { /* last range matched, but no mm flag found */ printf("never found mm flag.\n"); break; } else if (caddr < lo) { /* never found the range for caddr */ printf("#######no match for addr %p.\n", caddr); break; } else if (caddr < hi) { /* start address is in this range */ size_t rangelen = (size_t)(hi - caddr); /* remember that matching has started */ needmm = 1; /* calculate remaining range to search for */ if (len > rangelen) { len -= rangelen; caddr += rangelen; printf("matched %zu bytes in range " "%p-%p, %zu left over.\n", rangelen, lo, hi, len); } else { len = 0; printf("matched all bytes in range " "%p-%p.\n", lo, hi); } } } else if (needmm && strncmp(line, vmflags, sizeof(vmflags) - 1) == 0) { if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) { printf("mm flag found.\n"); if (len == 0) { /* entire range matched */ retval = 1; break; } needmm = 0; /* saw what was needed */ } else { /* mm flag not set for some or all of range */ printf("range has no mm flag.\n"); break; } } } fclose(fp); printf("returning %d.\n", retval); return retval; } void *Addr; size_t Size; /* * worker -- the work each thread performs */ static void * worker(void *arg) { int *ret = (int *)arg; *ret = is_pmem_proc(Addr, Size); return NULL; } int main(int argc, char *argv[]) { if (argc < 2 || argc > 3) { printf("usage: %s file [env].\n", argv[0]); return -1; } int fd = open(argv[1], O_RDWR); struct stat stbuf; fstat(fd, &stbuf); Size = stbuf.st_size; Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); close(fd); pthread_t threads[NTHREAD]; int ret[NTHREAD]; /* kick off NTHREAD threads */ for (int i = 0; i < NTHREAD; i++) pthread_create(&threads[i], NULL, worker, &ret[i]); /* wait for all the threads to complete */ for (int i = 0; i < NTHREAD; i++) pthread_join(threads[i], NULL); /* verify that all the threads return the same value */ for (int i = 1; i < NTHREAD; i++) { if (ret[0] != ret[i]) { printf("Error i %d ret[0] = %d ret[i] = %d.\n", i, ret[0], ret[i]); } } printf("%d", ret[0]); return 0; } It failed as some threads can not find the memory region in "/proc/self/smaps" which is allocated in the main process It is caused by proc fs which uses 'file->version' to indicate the VMA that is the last one has already been handled by read() system call. When the next read() issues, it uses the 'version' to find the VMA, then the next VMA is what we want to handle, the related code is as follows: if (last_addr) { vma = find_vma(mm, last_addr); if (vma && (vma = m_next_vma(priv, vma))) return vma; } However, VMA will be lost if the last VMA is gone, e.g: The process VMA list is A->B->C->D CPU 0 CPU 1 read() system call handle VMA B version = B return to userspace unmap VMA B issue read() again to continue to get the region info find_vma(version) will get VMA C m_next_vma(C) will get VMA D handle D !!! VMA C is lost !!! In order to fix this bug, we make 'file->version' indicate the end address of the current VMA. m_start will then look up a vma which with vma_start < last_vm_end and moves on to the next vma if we found the same or an overlapping vma. This will guarantee that we will not miss an exclusive vma but we can still miss one if the previous vma was shrunk. This is acceptable because guaranteeing "never miss a vma" is simply not feasible. User has to cope with some inconsistencies if the file is not read in one go. [[email protected]: changelog fixes] Link: http://lkml.kernel.org/r/[email protected] Acked-by: Dave Hansen <[email protected]> Signed-off-by: Xiao Guangrong <[email protected]> Signed-off-by: Robert Hu <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Dan Williams <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Stefan Hajnoczi <[email protected]> Cc: Ross Zwisler <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07proc: fix timerslack_ns CAP_SYS_NICE check when adjusting selfJohn Stultz1-15/+19
In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS) to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p == current, but the CAP_SYS_NICE doesn't. Thus while the previous commit was intended to loosen the needed privileges to modify a processes timerslack, it needlessly restricted a task modifying its own timerslack via the proc/<tid>/timerslack_ns (which is permitted also via the PR_SET_TIMERSLACK method). This patch corrects this by checking if p == current before checking the CAP_SYS_NICE value. This patch applies on top of my two previous patches currently in -mm Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: John Stultz <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Oren Laadan <[email protected]> Cc: Ruchi Kandoi <[email protected]> Cc: Rom Lemarchand <[email protected]> Cc: Todd Kjos <[email protected]> Cc: Colin Cross <[email protected]> Cc: Nick Kralevich <[email protected]> Cc: Dmitry Shmidt <[email protected]> Cc: Elliott Hughes <[email protected]> Cc: Android Kernel Team <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07proc: add LSM hook checks to /proc/<tid>/timerslack_nsJohn Stultz1-0/+10
As requested, this patch checks the existing LSM hooks task_getscheduler/task_setscheduler when reading or modifying the task's timerslack value. Previous versions added new get/settimerslack LSM hooks, but since they checked the same PROCESS__SET/GETSCHED values as existing hooks, it was suggested we just use the existing ones. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: John Stultz <[email protected]> Cc: Kees Cook <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Oren Laadan <[email protected]> Cc: Ruchi Kandoi <[email protected]> Cc: Rom Lemarchand <[email protected]> Cc: Todd Kjos <[email protected]> Cc: Colin Cross <[email protected]> Cc: Nick Kralevich <[email protected]> Cc: Dmitry Shmidt <[email protected]> Cc: Elliott Hughes <[email protected]> Cc: James Morris <[email protected]> Cc: Android Kernel Team <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07proc: relax /proc/<tid>/timerslack_ns capability requirementsJohn Stultz1-14/+20
When an interface to allow a task to change another tasks timerslack was first proposed, it was suggested that something greater then CAP_SYS_NICE would be needed, as a task could be delayed further then what normally could be done with nice adjustments. So CAP_SYS_PTRACE was adopted instead for what became the /proc/<tid>/timerslack_ns interface. However, for Android (where this feature originates), giving the system_server CAP_SYS_PTRACE would allow it to observe and modify all tasks memory. This is considered too high a privilege level for only needing to change the timerslack. After some discussion, it was realized that a CAP_SYS_NICE process can set a task as SCHED_FIFO, so they could fork some spinning processes and set them all SCHED_FIFO 99, in effect delaying all other tasks for an infinite amount of time. So as a CAP_SYS_NICE task can already cause trouble for other tasks, using it as a required capability for accessing and modifying /proc/<tid>/timerslack_ns seems sufficient. Thus, this patch loosens the capability requirements to CAP_SYS_NICE and removes CAP_SYS_PTRACE, simplifying some of the code flow as well. This is technically an ABI change, but as the feature just landed in 4.6, I suspect no one is yet using it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: John Stultz <[email protected]> Reviewed-by: Nick Kralevich <[email protected]> Acked-by: Serge Hallyn <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Kees Cook <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Oren Laadan <[email protected]> Cc: Ruchi Kandoi <[email protected]> Cc: Rom Lemarchand <[email protected]> Cc: Todd Kjos <[email protected]> Cc: Colin Cross <[email protected]> Cc: Nick Kralevich <[email protected]> Cc: Dmitry Shmidt <[email protected]> Cc: Elliott Hughes <[email protected]> Cc: Android Kernel Team <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07meminfo: break apart a very long seq_printf with #ifdefsJoe Perches1-116/+95
Use a specific routine to emit most lines so that the code is easier to read and maintain. akpm: text data bss dec hex filename 2976 8 0 2984 ba8 fs/proc/meminfo.o before 2669 8 0 2677 a75 fs/proc/meminfo.o after Link: http://lkml.kernel.org/r/8fce7fdef2ba081a4ef531594e97da8a9feebb58.1470810406.git.joe@perches.com Signed-off-by: Joe Perches <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not charJoe Perches5-136/+154
Allow some seq_puts removals by taking a string instead of a single char. [[email protected]: update vmstat_show(), per Joe] Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com Signed-off-by: Joe Perches <[email protected]> Cc: Joe Perches <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07proc: faster /proc/*/statusAlexey Dobriyan1-40/+47
top(1) opens the following files for every PID: /proc/*/stat /proc/*/statm /proc/*/status This patch switches /proc/*/status away from seq_printf(). The result is 13.5% speedup. Benchmark is open("/proc/self/status")+read+close 1.000.000 million times. BEFORE $ perf stat -r 10 taskset -c 3 ./proc-self-status Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs): 10748.474301 task-clock (msec) # 0.954 CPUs utilized ( +- 0.91% ) 12 context-switches # 0.001 K/sec ( +- 1.09% ) 1 cpu-migrations # 0.000 K/sec 104 page-faults # 0.010 K/sec ( +- 0.45% ) 37,424,127,876 cycles # 3.482 GHz ( +- 0.04% ) 8,453,010,029 stalled-cycles-frontend # 22.59% frontend cycles idle ( +- 0.12% ) 3,747,609,427 stalled-cycles-backend # 10.01% backend cycles idle ( +- 0.68% ) 65,632,764,147 instructions # 1.75 insn per cycle # 0.13 stalled cycles per insn ( +- 0.00% ) 13,981,324,775 branches # 1300.773 M/sec ( +- 0.00% ) 138,967,110 branch-misses # 0.99% of all branches ( +- 0.18% ) 11.263885428 seconds time elapsed ( +- 0.04% ) ^^^^^^^^^^^^ AFTER $ perf stat -r 10 taskset -c 3 ./proc-self-status Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs): 9010.521776 task-clock (msec) # 0.925 CPUs utilized ( +- 1.54% ) 11 context-switches # 0.001 K/sec ( +- 1.54% ) 1 cpu-migrations # 0.000 K/sec ( +- 11.11% ) 103 page-faults # 0.011 K/sec ( +- 0.60% ) 32,352,310,603 cycles # 3.591 GHz ( +- 0.07% ) 7,849,199,578 stalled-cycles-frontend # 24.26% frontend cycles idle ( +- 0.27% ) 3,269,738,842 stalled-cycles-backend # 10.11% backend cycles idle ( +- 0.73% ) 56,012,163,567 instructions # 1.73 insn per cycle # 0.14 stalled cycles per insn ( +- 0.00% ) 11,735,778,795 branches # 1302.453 M/sec ( +- 0.00% ) 98,084,459 branch-misses # 0.84% of all branches ( +- 0.28% ) 9.741247736 seconds time elapsed ( +- 0.07% ) ^^^^^^^^^^^ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexey Dobriyan <[email protected]> Cc: Joe Perches <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07proc: much faster /proc/vmstatAlexey Dobriyan1-1/+4
Every current KDE system has process named ksysguardd polling files below once in several seconds: $ strace -e trace=open -p $(pidof ksysguardd) Process 1812 attached open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8 open("/proc/net/dev", O_RDONLY) = 8 open("/proc/net/wireless", O_RDONLY) = -1 ENOENT (No such file or directory) open("/proc/stat", O_RDONLY) = 8 open("/proc/vmstat", O_RDONLY) = 8 Hell knows what it is doing but speed up reading /proc/vmstat by 33%! Benchmark is open+read+close 1.000.000 times. BEFORE $ perf stat -r 10 taskset -c 3 ./proc-vmstat Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs): 13146.768464 task-clock (msec) # 0.960 CPUs utilized ( +- 0.60% ) 15 context-switches # 0.001 K/sec ( +- 1.41% ) 1 cpu-migrations # 0.000 K/sec ( +- 11.11% ) 104 page-faults # 0.008 K/sec ( +- 0.57% ) 45,489,799,349 cycles # 3.460 GHz ( +- 0.03% ) 9,970,175,743 stalled-cycles-frontend # 21.92% frontend cycles idle ( +- 0.10% ) 2,800,298,015 stalled-cycles-backend # 6.16% backend cycles idle ( +- 0.32% ) 79,241,190,850 instructions # 1.74 insn per cycle # 0.13 stalled cycles per insn ( +- 0.00% ) 17,616,096,146 branches # 1339.956 M/sec ( +- 0.00% ) 176,106,232 branch-misses # 1.00% of all branches ( +- 0.18% ) 13.691078109 seconds time elapsed ( +- 0.03% ) ^^^^^^^^^^^^ AFTER $ perf stat -r 10 taskset -c 3 ./proc-vmstat Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs): 8688.353749 task-clock (msec) # 0.950 CPUs utilized ( +- 1.25% ) 10 context-switches # 0.001 K/sec ( +- 2.13% ) 1 cpu-migrations # 0.000 K/sec 104 page-faults # 0.012 K/sec ( +- 0.56% ) 30,384,010,730 cycles # 3.497 GHz ( +- 0.07% ) 12,296,259,407 stalled-cycles-frontend # 40.47% frontend cycles idle ( +- 0.13% ) 3,370,668,651 stalled-cycles-backend # 11.09% backend cycles idle ( +- 0.69% ) 28,969,052,879 instructions # 0.95 insn per cycle # 0.42 stalled cycles per insn ( +- 0.01% ) 6,308,245,891 branches # 726.058 M/sec ( +- 0.00% ) 214,685,502 branch-misses # 3.40% of all branches ( +- 0.26% ) 9.146081052 seconds time elapsed ( +- 0.07% ) ^^^^^^^^^^^ vsnprintf() is slow because: 1. format_decode() is busy looking for format specifier: 2 branches per character (not in this case, but in others) 2. approximately million branches while parsing format mini language and everywhere 3. just look at what string() does /proc/vmstat is good case because most of its content are strings Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexey Dobriyan <[email protected]> Cc: Joe Perches <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07atomic64: no need for CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVEVineet Gupta12-17/+0
This came to light when implementing native 64-bit atomics for ARCv2. The atomic64 self-test code uses CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE to check whether atomic64_dec_if_positive() is available. It seems it was needed when not every arch defined it. However as of current code the Kconfig option seems needless - for CONFIG_GENERIC_ATOMIC64 it is auto-enabled in lib/Kconfig and a generic definition of API is present lib/atomic64.c - arches with native 64-bit atomics select it in arch/*/Kconfig and define the API in their headers So I see no point in keeping the Kconfig option Compile tested for: - blackfin (CONFIG_GENERIC_ATOMIC64) - x86 (!CONFIG_GENERIC_ATOMIC64) - ia64 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vineet Gupta <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Russell King <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Helge Deller <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Zhaoxiu Zeng <[email protected]> Cc: Linus Walleij <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Herbert Xu <[email protected]> Cc: Ming Lin <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07ia64: implement atomic64_dec_if_positiveVineet Gupta1-0/+16
This is based on s390 version and needed to get rid of CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vineet Gupta <[email protected]> Reported-by: kbuild test robot <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07linux/mm.h: canonicalize macro PAGE_ALIGNED() definitionzijun_hu1-1/+1
The macro PAGE_ALIGNED() is prone to cause error because it doesn't follow convention to parenthesize parameter @addr within macro body, for example unsigned long *ptr = kmalloc(...); PAGE_ALIGNED(ptr + 16); for the left parameter of macro IS_ALIGNED(), (unsigned long)(ptr + 16) is desired but the actual one is (unsigned long)ptr + 16. It is fixed by simply canonicalizing macro PAGE_ALIGNED() definition. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: zijun_hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: remove unnecessary condition in remove_inode_hugepageszhong jiang3-10/+8
When the huge page is added to the page cahce (huge_add_to_page_cache), the page private flag will be cleared. since this code (remove_inode_hugepages) will only be called for pages in the page cahce, PagePrivate(page) will always be false. The patch remove the code without any functional change. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: zhong jiang <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Mike Kravetz <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: warn about allocations which stall for too longMichal Hocko1-0/+10
Currently we do warn only about allocation failures but small allocations are basically nofail and they might loop in the page allocator for a long time. Especially when the reclaim cannot make any progress - e.g. GFP_NOFS cannot invoke the oom killer and rely on a different context to make a forward progress in case there is a lot memory used by filesystems. Give us at least a clue when something like this happens and warn about allocations which take more than 10s. Print the basic allocation context information along with the cumulative time spent in the allocation as well as the allocation stack. Repeat the warning after every 10 seconds so that we know that the problem is permanent rather than ephemeral. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: consolidate warn_alloc_failed usersMichal Hocko3-26/+20
warn_alloc_failed is currently used from the page and vmalloc allocators. This is a good reuse of the code except that vmalloc would appreciate a slightly different warning message. This is already handled by the fmt parameter except that "%s: page allocation failure: order:%u, mode:%#x(%pGg)" is printed anyway. This might be quite misleading because it might be a vmalloc failure which leads to the warning while the page allocator is not the culprit here. Fix this by always using the fmt string and only print the context that makes sense for the particular context (e.g. order makes only very little sense for the vmalloc context). Rename the function to not miss any user and also because a later patch will reuse it also for !failure cases. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07vfs,mm: fix a dead loop in truncate_inode_pages_range()Wei Fang1-0/+4
We triggered a deadloop in truncate_inode_pages_range() on 32 bits architecture with the test case bellow: ... fd = open(); write(fd, buf, 4096); preadv64(fd, &iovec, 1, 0xffffffff000); ftruncate(fd, 0); ... Then ftruncate() will not return forever. The filesystem used in this case is ubifs, but it can be triggered on many other filesystems. When preadv64() is called with offset=0xffffffff000, a page with index=0xffffffff will be added to the radix tree of ->mapping. Then this page can be found in ->mapping with pagevec_lookup(). After that, truncate_inode_pages_range(), which is called in ftruncate(), will fall into an infinite loop: - find a page with index=0xffffffff, since index>=end, this page won't be truncated - index++, and index become 0 - the page with index=0xffffffff will be found again The data type of index is unsigned long, so index won't overflow to 0 on 64 bits architecture in this case, and the dead loop won't happen. Since truncate_inode_pages_range() is executed with holding lock of inode->i_rwsem, any operation related with this lock will be blocked, and a hung task will happen, e.g.: INFO: task truncate_test:3364 blocked for more than 120 seconds. ... call_rwsem_down_write_failed+0x17/0x30 generic_file_write_iter+0x32/0x1c0 ubifs_write_iter+0xcc/0x170 __vfs_write+0xc4/0x120 vfs_write+0xb2/0x1b0 SyS_write+0x46/0xa0 The page with index=0xffffffff added to ->mapping is useless. Fix this by checking the read position before allocating pages. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Fang <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Al Viro <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07arm64 Kconfig: select gigantic pageYisheng Xie1-0/+1
Arm64 supports gigantic pages after commit 084bd29810a5 ("ARM64: mm: HugeTLB support.") however, it can only be allocated at boottime and can't be freed. This patch selects ARCH_HAS_GIGANTIC_PAGE to make gigantic pages can be allocated and freed at runtime for arch arm64. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yisheng Xie <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Catalin Marinas <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Hanjun Guo <[email protected]> Cc: Will Deacon <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Sudeep Holla <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Rob Herring <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGEYisheng Xie4-1/+6
Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic page. No functional change with this patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yisheng Xie <[email protected]> Suggested-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Hanjun Guo <[email protected]> Cc: Will Deacon <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Sudeep Holla <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Rob Herring <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07oom: print nodemask in the oom reportMichal Hocko1-2/+5
We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Sellami Abdelkader <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Sellami Abdelkader <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: clarify why we avoid page_mapcount() for slab pages in dump_page()Kirill A. Shutemov1-0/+5
Let's add comment on why we skip page_mapcount() for sl[aou]b pages. Link: http://lkml.kernel.org/r/20160922105532.GB24593@node Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rbAndrea Arcangeli1-18/+41
The old code was always doing: vma->vm_end = next->vm_end vma_rb_erase(next) // in __vma_unlink vma->vm_next = next->vm_next // in __vma_unlink next = vma->vm_next vma_gap_update(next) The new code still does the above for remove_next == 1 and 2, but for remove_next == 3 it has been changed and it does: next->vm_start = vma->vm_start vma_rb_erase(vma) // in __vma_unlink vma_gap_update(next) In the latter case, while unlinking "vma", validate_mm_rb() is told to ignore "vma" that is being removed, but next->vm_start was reduced instead. So for the new case, to avoid the false positive from validate_mm_rb, it should be "next" that is ignored when "vma" is being unlinked. "vma" and "next" in the above comment, considered pre-swap(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Tested-by: Shaun Tancheff <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jan Vorlicek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: vma_adjust: minor comment correctionAndrea Arcangeli1-1/+1
The cases are three not two. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jan Vorlicek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: vma_adjust: remove superfluous check for next not NULLAndrea Arcangeli1-1/+1
If next would be NULL we couldn't reach such code path. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jan Vorlicek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walkAndrea Arcangeli3-29/+139
The rmap_walk can access vm_page_prot (and potentially vm_flags in the pte/pmd manipulations). So it's not safe to wait the caller to update the vm_page_prot/vm_flags after vma_merge returned potentially removing the "next" vma and extending the "current" vma over the next->vm_start,vm_end range, but still with the "current" vma vm_page_prot, after releasing the rmap locks. The vm_page_prot/vm_flags must be transferred from the "next" vma to the current vma while vma_merge still holds the rmap locks. The side effect of this race condition is pte corruption during migrate as remove_migration_ptes when run on a address of the "next" vma that got removed, used the vm_page_prot of the current vma. migrate mprotect ------------ ------------- migrating in "next" vma vma_merge() # removes "next" vma and # extends "current" vma # current vma is not with # vm_page_prot updated remove_migration_ptes read vm_page_prot of current "vma" establish pte with wrong permissions vm_set_page_prot(vma) # too late! change_protection in the old vma range only, next range is not updated This caused segmentation faults and potentially memory corruption in heavy mprotect loads with some light page migration caused by compaction in the background. Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge which confirms the case 8 is only buggy one where the race can trigger, in all other vma_merge cases the above cannot happen. This fix removes the oddness factor from case 8 and it converts it from: AAAA PPPPNNNNXXXX -> PPPPNNNNNNNN to: AAAA PPPPNNNNXXXX -> PPPPXXXXXXXX XXXX has the right vma properties for the whole merged vma returned by vma_adjust, so it solves the problem fully. It has the added benefits that the callers could stop updating vma properties when vma_merge succeeds however the callers are not updated by this patch (there are bits like VM_SOFTDIRTY that still need special care for the whole range, as the vma merging ignores them, but as long as they're not processed by rmap walks and instead they're accessed with the mmap_sem at least for reading, they are fine not to be updated within vma_adjust before releasing the rmap_locks). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Reported-by: Aditya Mandaleeka <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jan Vorlicek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: vma_adjust: remove superfluous confusing update in remove_next == 1 caseAndrea Arcangeli1-2/+22
mm->highest_vm_end doesn't need any update. After finally removing the oddness from vma_merge case 8 that was causing: 1) constant risk of trouble whenever anybody would check vma fields from rmap_walks, like it happened when page migration was introduced and it read the vma->vm_page_prot from a rmap_walk 2) the callers of vma_merge to re-initialize any value different from the current vma, instead of vma_merge() more reliably returning a vma that already matches all fields passed as parameter .. it is also worth to take the opportunity of cleaning up superfluous code in vma_adjust(), that if not removed adds up to the hard readability of the function. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jan Vorlicek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: vm_page_prot: update with WRITE_ONCE/READ_ONCEAndrea Arcangeli5-11/+13
vma->vm_page_prot is read lockless from the rmap_walk, it may be updated concurrently and this prevents the risk of reading intermediate values. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jan Vorlicek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node()zhong jiang1-1/+6
According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can in rare cases cause a hung task warning. At present, if alloc_stable_node() allocation fails, two break_cows may want to allocate a couple of pages, and the issue will come up when free memory is under pressure. We fix it by adding __GFP_HIGH to GFP, to grant access to memory reserves, increasing the likelihood of allocation success. [[email protected]: tweak comment] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: zhong jiang <[email protected]> Suggested-by: Hugh Dickins <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm/page_isolation: fix typo: "paes" -> "pages"Yisheng Xie1-1/+1
Fix typo in comment. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yisheng Xie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm/hugetlb: improve locking in dissolve_free_huge_pages()Gerald Schaefer1-3/+9
For every pfn aligned to minimum_order, dissolve_free_huge_pages() will call dissolve_free_huge_page() which takes the hugetlb spinlock, even if the page is not huge at all or a hugepage that is in-use. Improve this by doing the PageHuge() and page_count() checks already in dissolve_free_huge_pages() before calling dissolve_free_huge_page(). In dissolve_free_huge_page(), when holding the spinlock, those checks need to be revalidated. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Gerald Schaefer <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Rui Teng <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>