aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2015-09-10checkpatch: emit an error on formats with 0x%<decimal>Joe Perches1-2/+6
Using 0x%d is wrong. Emit a message when it happens. Miscellanea: Improve the %Lu warning to match formats like %16Lu. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: make --strict the default for drivers/staging files and patchesJoe Perches1-1/+1
Making --strict the default for staging may help some people submit patches without obvious defects. Signed-off-by: Joe Perches <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: always check block comment stylesJoe Perches1-8/+11
Some of the block comment tests that are used only for networking are appropriate for all patches. For example, these styles are not encouraged: /* block comment without introductory * */ and /* * block comment with line terminating */ Remove the networking specific test and add comments. There are some infrequent false positives where code is lazily commented out using /* and */ rather than using #if 0/#endif blocks like: /* case foo: case bar: */ case baz: Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: report the right line # when using --emacs and --fileJoe Perches1-1/+5
commit 34d8815f9512 ("checkpatch: add --showfile to allow input via pipe to show filenames") broke the --emacs with --file option. Fix it. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: add some <foo>_destroy functions to NEEDLESS_IF testsJoe Perches1-4/+28
Sergey Senozhatsky has modified several destroy functions that can now be called with NULL values. - kmem_cache_destroy() - mempool_destroy() - dma_pool_destroy() Update checkpatch to warn when those functions are preceded by an if. Update checkpatch to --fix all the calls too only when the code style form is using leading tabs. from: if (foo) <func>(foo); to: <func>(foo); Signed-off-by: Joe Perches <[email protected]> Tested-by: Sergey Senozhatsky <[email protected]> Cc: David Rientjes <[email protected]> Cc: Julia Lawall <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: Allow longer declaration macrosJoe Perches1-1/+1
Some really long declaration macros exist. For instance; DEFINE_DMA_BUF_EXPORT_INFO(exp_info); and DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(name, description) Increase the limit from 2 words to 6 after DECLARE/DEFINE uses. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: improve SUSPECT_CODE_INDENT testJoe Perches1-6/+15
Many lines exist like if (foo) bar; where the tabbed indentation of the branch is not one more than the "if" line above it. checkpatch should emit a warning on those lines. Miscellenea: o Remove comments from branch blocks o Skip blank lines in block Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: add warning on BUG/BUG_ON useJoe Perches1-6/+8
Using BUG/BUG_ON crashes the kernel and is just unfriendly. Enable code that emits a warning on BUG/BUG_ON use. Make the code emit the message at WARNING level when scanning a patch and at CHECK level when scanning files so that script users don't feel an obligation to fix code that might be above their pay grade. Signed-off-by: Joe Perches <[email protected]> Reported-by: Geert Uytterhoeven <[email protected]> Tested-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: warn on bare SHA-1 commit IDs in commit logsJoe Perches1-3/+12
Commit IDs should have commit descriptions too. Warn when a 12 to 40 byte SHA-1 is used in commit logs. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/test_kasan.c: make kmalloc_oob_krealloc_less more correctlyWang Long1-1/+1
In kmalloc_oob_krealloc_less, I think it is better to test the size2 boundary. If we do not call krealloc, the access of position size1 will still cause out-of-bounds and access of position size2 does not. After call krealloc, the access of position size2 cause out-of-bounds. So using size2 is more correct. Signed-off-by: Wang Long <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/test_kasan.c: fix a typoWang Long1-2/+2
Signed-off-by: Wang Long <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/string_helpers: rename "esc" arg to "only"Kees Cook2-14/+14
To further clarify the purpose of the "esc" argument, rename it to "only" to reflect that it is a limit, not a list of additional characters to escape. Signed-off-by: Kees Cook <[email protected]> Suggested-by: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/string_helpers: clarify esc arg in string_escape_memKees Cook1-4/+6
The esc argument is used to reduce which characters will be escaped. For example, using " " with ESCAPE_SPACE will not produce any escaped spaces. Signed-off-by: Kees Cook <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Mathias Krause <[email protected]> Cc: James Bottomley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10hexdump: do not print debug dumps for !CONFIG_DEBUGLinus Walleij1-2/+8
print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or dev_dbg() & friends. Currently it will adhere to dynamic debug, but will not stub out prints if CONFIG_DEBUG is not set. Let's make it do the right thing, because I am tired of having my dmesg buffer full of hex dumps on production systems. Signed-off-by: Linus Walleij <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/bitmap.c: bitmap_parselist can accept string with whitespaces on head or ↵Pan Xinhui1-14/+18
tail In __bitmap_parselist we can accept whitespaces on head or tail during every parsing procedure. If input has valid ranges, there is no reason to reject the user. For example, bitmap_parselist(" 1-3, 5, ", &mask, nmaskbits). After separating the string, we get " 1-3", " 5", and " ". It's possible and reasonable to accept such string as long as the parsing result is correct. Signed-off-by: Pan Xinhui <[email protected]> Cc: Yury Norov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Sudeep Holla <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/bitmap.c: fix a special string handling bug in __bitmap_parselistPan Xinhui1-0/+4
If string end with '-', for exapmle, bitmap_parselist("1,0-",&mask, nmaskbits), It is not in a valid pattern, so add a check after loop. Return -EINVAL on such condition. Signed-off-by: Pan Xinhui <[email protected]> Cc: Yury Norov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Sudeep Holla <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/bitmap.c: correct a code style and do some, optimizationPan Xinhui1-3/+4
We can avoid in-loop incrementation of ndigits. Save current totaldigits to ndigits before loop, and check ndigits against totaldigits after the loop. Signed-off-by: Pan Xinhui <[email protected]> Cc: Yury Norov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Sudeep Holla <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10proc: convert to kstrto*()/kstrto*_from_user()Alexey Dobriyan1-49/+21
Convert from manual allocation/copy_from_user/... to kstrto*() family which were designed for exactly that. One case can not be converted to kstrto*_from_user() to make code even more simpler because of whitespace stripping, oh well... Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kstrto*: accept "-0" for signed conversionAlexey Dobriyan2-6/+2
strtol(3) et al accept "-0", so should we. Signed-off-by: Alexey Dobriyan <[email protected]> Cc: David Howells <[email protected]> Cc: Jan Kara <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10MAINTAINERS/CREDITS: mark MaxRAID as Orphan, move Anil Ravindranath to CREDITSJoe Perches2-2/+5
Anil's email address bounces and he hasn't had a signoff in over 5 years. Signed-off-by: Joe Perches <[email protected]> Cc: James Bottomley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10include/linux/printk.h: include pr_fmt in pr_debug_ratelimitedJason A. Donenfeld1-2/+2
The other two implementations of pr_debug_ratelimited include pr_fmt, along with every other pr_* function. But pr_debug_ratelimited forgot to add it with the CONFIG_DYNAMIC_DEBUG implementation. This patch unifies the behavior. Signed-off-by: Jason A. Donenfeld <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kernel/cred.c: remove unnecessary kdebug atomic readsJoe Perches1-4/+9
Commit e0e817392b9a ("CRED: Add some configurable debugging [try #6]") added the kdebug mechanism to this file back in 2009. The kdebug macro calls no_printk which always evaluates arguments. Most of the kdebug uses have an unnecessary call of atomic_read(&cred->usage) Make the kdebug macro do nothing by defining it with do { if (0) no_printk(...); } while (0) when not enabled. $ size kernel/cred.o* (defconfig x86-64) text data bss dec hex filename 2748 336 8 3092 c14 kernel/cred.o.new 2788 336 8 3132 c3c kernel/cred.o.old Miscellanea: o Neaten the #define kdebug macros while there Signed-off-by: Joe Perches <[email protected]> Cc: David Howells <[email protected]> Cc: James Morris <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kernel/extable.c: remove duplicated includeWei Yongjun1-1/+0
Signed-off-by: Wei Yongjun <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10include/linux/poison.h: remove not-used poison pointer macrosVasily Kulikov1-7/+0
Signed-off-by: Vasily Kulikov <[email protected]> Cc: Solar Designer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10include/linux/poison.h: fix LIST_POISON{1,2} offsetVasily Kulikov1-2/+2
Poison pointer values should be small enough to find a room in non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space" is located starting from 0x0. Given unprivileged users cannot mmap anything below mmap_min_addr, it should be safe to use poison pointers lower than mmap_min_addr. The current poison pointer values of LIST_POISON{1,2} might be too big for mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu uses only 0x10000). There is little point to use such a big value given the "poison pointer space" below 1 MB is not yet exhausted. Changing it to a smaller value solves the problem for small mmap_min_addr setups. The values are suggested by Solar Designer: http://www.openwall.com/lists/oss-security/2015/05/02/6 Signed-off-by: Vasily Kulikov <[email protected]> Cc: Solar Designer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10proc: change proc_subdir_lock to a rwlockWaiman Long1-22/+22
The proc_subdir_lock spinlock is used to allow only one task to make change to the proc directory structure as well as looking up information in it. However, the information lookup part can actually be entered by more than one task as the pde_get() and pde_put() reference count update calls in the critical sections are atomic increment and decrement respectively and so are safe with concurrent updates. The x86 architecture has already used qrwlock which is fair and other architectures like ARM are in the process of switching to qrwlock. So unfairness shouldn't be a concern in that conversion. This patch changed the proc_subdir_lock to a rwlock in order to enable concurrent lookup. The following functions were modified to take a write lock: - proc_register() - remove_proc_entry() - remove_proc_subtree() The following functions were modified to take a read lock: - xlate_proc_name() - proc_lookup_de() - proc_readdir_de() A parallel /proc filesystem search with the "find" command (1000 threads) was run on a 4-socket Haswell-EX box (144 threads). Before the patch, the parallel search took about 39s. After the patch, the parallel find took only 25s, a saving of about 14s. The micro-benchmark that I used was artificial, but it was used to reproduce an exit hanging problem that I saw in real application. In fact, only allow one task to do a lookup seems too limiting to me. Signed-off-by: Waiman Long <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Nicolas Dichtel <[email protected]> Cc: Al Viro <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Douglas Hatch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10procfs: always expose /proc/<pid>/map_files/ and make it readableCalvin Owens1-19/+24
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is only exposed if CONFIG_CHECKPOINT_RESTORE is set. Each mapped file region gets a symlink in /proc/<pid>/map_files/ corresponding to the virtual address range at which it is mapped. The symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them to the backing file even if that backing file has been unlinked. Currently, files which are mapped, unlinked, and closed are impossible to stat() from userspace. Exposing /proc/<pid>/map_files/ closes this functionality "hole". Not being able to stat() such files makes noticing and explicitly accounting for the space they use on the filesystem impossible. You can work around this by summing up the space used by every file in the filesystem and subtracting that total from what statfs() tells you, but that obviously isn't great, and it becomes unworkable once your filesystem becomes large enough. This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and adjusts the permissions enforced on it as follows: * proc_map_files_lookup() * proc_map_files_readdir() * map_files_d_revalidate() Remove the CAP_SYS_ADMIN restriction, leaving only the current restriction requiring PTRACE_MODE_READ. The information made available to userspace by these three functions is already available in /proc/PID/maps with MODE_READ, so I don't see any reason to limit them any further (see below for more detail). * proc_map_files_follow_link() This stub has been added, and requires that the user have CAP_SYS_ADMIN in order to follow the links in map_files/, since there was concern on LKML both about the potential for bypassing permissions on ancestor directories in the path to files pointed to, and about what happens with more exotic memory mappings created by some drivers (ie dma-buf). In older versions of this patch, I changed every permission check in the four functions above to enforce MODE_ATTACH instead of MODE_READ. This was an oversight on my part, and after revisiting the discussion it seems that nobody was concerned about anything outside of what is made possible by ->follow_link(). So in this version, I've left the checks for PTRACE_MODE_READ as-is. [[email protected]: catch up with concurrent proc_pid_follow_link() changes] Signed-off-by: Calvin Owens <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10proc: add cond_resched to /proc/kpage* read/write loopVladimir Davydov1-0/+6
Reading/writing a /proc/kpage* file may take long on machines with a lot of RAM installed. Signed-off-by: Vladimir Davydov <[email protected]> Suggested-by: Andres Lagar-Cavilla <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10proc: export idle flag via kpageflagsVladimir Davydov3-0/+11
As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags is that one can easily filter dirty and/or unevictable pages while estimating the size of unused memory. Note that idle flag read from /proc/kpageflags may be stale in case the page was accessed via a PTE, because it would be too costly to iterate over all page mappings on each /proc/kpageflags read to provide an up-to-date value. To make sure the flag is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10mm: introduce idle page trackingVladimir Davydov17-3/+512
Knowing the portion of memory that is not used by a certain application or memory cgroup (idle memory) can be useful for partitioning the system efficiently, e.g. by setting memory cgroup limits appropriately. Currently, the only means to estimate the amount of idle memory provided by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the access bit for all pages mapped to a particular process by writing 1 to clear_refs, wait for some time, and then count smaps:Referenced. However, this method has two serious shortcomings: - it does not count unmapped file pages - it affects the reclaimer logic To overcome these drawbacks, this patch introduces two new page flags, Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap. A page's Idle flag can only be set from userspace by setting bit in /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page, and it is cleared whenever the page is accessed either through page tables (it is cleared in page_referenced() in this case) or using the read(2) system call (mark_page_accessed()). Thus by setting the Idle flag for pages of a particular workload, which can be found e.g. by reading /proc/PID/pagemap, waiting for some time to let the workload access its working set, and then reading the bitmap file, one can estimate the amount of pages that are not used by the workload. The Young page flag is used to avoid interference with the memory reclaimer. A page's Young flag is set whenever the Access bit of a page table entry pointing to the page is cleared by writing to the bitmap file. If page_referenced() is called on a Young page, it will add 1 to its return value, therefore concealing the fact that the Access bit was cleared. Note, since there is no room for extra page flags on 32 bit, this feature uses extended page flags when compiled on 32 bit. [[email protected]: fix build] [[email protected]: kpageidle requires an MMU] [[email protected]: decouple from page-flags rework] Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10mmu-notifier: add clear_young callbackVladimir Davydov3-0/+92
In the scope of the idle memory tracking feature, which is introduced by the following patch, we need to clear the referenced/accessed bit not only in primary, but also in secondary ptes. The latter is required in order to estimate wss of KVM VMs. At the same time we want to avoid flushing tlb, because it is quite expensive and it won't really affect the final result. Currently, there is no function for clearing pte young bit that would meet our requirements, so this patch introduces one. To achieve that we have to add a new mmu-notifier callback, clear_young, since there is no method for testing-and-clearing a secondary pte w/o flushing tlb. The new method is not mandatory and currently only implemented by KVM. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10proc: add kpagecgroup fileVladimir Davydov2-1/+58
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Having this information is useful for estimating a cgroup working set size. The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10memcg: zap try_get_mem_cgroup_from_pageVladimir Davydov2-44/+13
It is only used in mem_cgroup_try_charge, so fold it in and zap it. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10hwpoison: use page_cgroup_ino for filtering by memcgVladimir Davydov2-18/+3
Hwpoison allows to filter pages by memory cgroup ino. Currently, it calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and then its ino using cgroup_ino, but now we have a helper method for that, page_cgroup_ino, so use it instead. This patch also loosens the hwpoison memcg filter dependency rules - it makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because hwpoison memcg filter does not require anything (nor it used to) from CONFIG_MEMCG_SWAP side. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10memcg: add page_cgroup_ino helperVladimir Davydov2-0/+29
This patchset introduces a new user API for tracking user memory pages that have not been used for a given period of time. The purpose of this is to provide the userspace with the means of tracking a workload's working set, i.e. the set of pages that are actively used by the workload. Knowing the working set size can be useful for partitioning the system more efficiently, e.g. by tuning memory cgroup limits appropriately, or for job placement within a compute cluster. ==== USE CASES ==== The unified cgroup hierarchy has memory.low and memory.high knobs, which are defined as the low and high boundaries for the workload working set size. However, the working set size of a workload may be unknown or change in time. With this patch set, one can periodically estimate the amount of memory unused by each cgroup and tune their memory.low and memory.high parameters accordingly, therefore optimizing the overall memory utilization. Another use case is balancing workloads within a compute cluster. Knowing how much memory is not really used by a workload unit may help take a more optimal decision when considering migrating the unit to another node within the cluster. Also, as noted by Minchan, this would be useful for per-process reclaim (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle pages only by smart user memory manager. ==== USER API ==== The user API consists of two new files: * /sys/kernel/mm/page_idle/bitmap. This file implements a bitmap where each bit corresponds to a page, indexed by PFN. When the bit is set, the corresponding page is idle. A page is considered idle if it has not been accessed since it was marked idle. To mark a page idle one should set the bit corresponding to the page by writing to the file. A value written to the file is OR-ed with the current bitmap value. Only user memory pages can be marked idle, for other page types input is silently ignored. Writing to this file beyond max PFN results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set. This file can be used to estimate the amount of pages that are not used by a particular workload as follows: 1. mark all pages of interest idle by setting corresponding bits in the /sys/kernel/mm/page_idle/bitmap 2. wait until the workload accesses its working set 3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set * /proc/kpagecgroup. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when CONFIG_MEMCG is set. This file can be used to find all pages (including unmapped file pages) accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one can then estimate the cgroup working set size. For an example of using these files for estimating the amount of unused memory pages per each memory cgroup, please see the script attached below. ==== REASONING ==== The reason to introduce the new user API instead of using /proc/PID/{clear_refs,smaps} is that the latter has two serious drawbacks: - it does not count unmapped file pages - it affects the reclaimer logic The new API attempts to overcome them both. For more details on how it is achieved, please see the comment to patch 6. ==== PATCHSET STRUCTURE ==== The patch set is organized as follows: - patch 1 adds page_cgroup_ino() helper for the sake of /proc/kpagecgroup and patches 2-3 do related cleanup - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is charged to - patch 5 introduces a new mmu notifier callback, clear_young, which is a lightweight version of clear_flush_young; it is used in patch 6 - patch 6 implements the idle page tracking feature, including the userspace API, /sys/kernel/mm/page_idle/bitmap - patch 7 exports idle flag via /proc/kpageflags ==== SIMILAR WORKS ==== Originally, the patch for tracking idle memory was proposed back in 2011 by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main difference between Michel's patch and this one is that Michel implemented a kernel space daemon for estimating idle memory size per cgroup while this patch only provides the userspace with the minimal API for doing the job, leaving the rest up to the userspace. However, they both share the same idea of Idle/Young page flags to avoid affecting the reclaimer logic. ==== PERFORMANCE EVALUATION ==== SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the performance impact introduced by this patch set. Three runs were carried out: - base: kernel without the patch - patched: patched kernel, the feature is not used - patched-active: patched kernel, 1 minute-period daemon is used for tracking idle memory For tracking idle memory, idlememstat utility was used: https://github.com/locker/idlememstat testcase base patched patched-active compiler 537.40 ( 0.00)% 532.26 (-0.96)% 538.31 ( 0.17)% compress 305.47 ( 0.00)% 301.08 (-1.44)% 300.71 (-1.56)% crypto 284.32 ( 0.00)% 282.21 (-0.74)% 284.87 ( 0.19)% derby 411.05 ( 0.00)% 413.44 ( 0.58)% 412.07 ( 0.25)% mpegaudio 189.96 ( 0.00)% 190.87 ( 0.48)% 189.42 (-0.28)% scimark.large 46.85 ( 0.00)% 46.41 (-0.94)% 47.83 ( 2.09)% scimark.small 412.91 ( 0.00)% 415.41 ( 0.61)% 421.17 ( 2.00)% serial 204.23 ( 0.00)% 213.46 ( 4.52)% 203.17 (-0.52)% startup 36.76 ( 0.00)% 35.49 (-3.45)% 35.64 (-3.05)% sunflow 115.34 ( 0.00)% 115.08 (-0.23)% 117.37 ( 1.76)% xml 620.55 ( 0.00)% 619.95 (-0.10)% 620.39 (-0.03)% composite 211.50 ( 0.00)% 211.15 (-0.17)% 211.67 ( 0.08)% time idlememstat: 17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k 448inputs+40outputs (1major+36052minor)pagefaults 0swaps ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ==== #! /usr/bin/python # import os import stat import errno import struct CGROUP_MOUNT = "/sys/fs/cgroup/memory" BUFSIZE = 8 * 1024 # must be multiple of 8 def get_hugepage_size(): with open("/proc/meminfo", "r") as f: for s in f: k, v = s.split(":") if k == "Hugepagesize": return int(v.split()[0]) * 1024 PAGE_SIZE = os.sysconf("SC_PAGE_SIZE") HUGEPAGE_SIZE = get_hugepage_size() def set_idle(): f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE) while True: try: f.write(struct.pack("Q", pow(2, 64) - 1)) except IOError as err: if err.errno == errno.ENXIO: break raise f.close() def count_idle(): f_flags = open("/proc/kpageflags", "rb", BUFSIZE) f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE) with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f: while f.read(BUFSIZE): pass # update idle flag idlememsz = {} while True: s1, s2 = f_flags.read(8), f_cgroup.read(8) if not s1 or not s2: break flags, = struct.unpack('Q', s1) cgino, = struct.unpack('Q', s2) unevictable = (flags >> 18) & 1 huge = (flags >> 22) & 1 idle = (flags >> 25) & 1 if idle and not unevictable: idlememsz[cgino] = idlememsz.get(cgino, 0) + \ (HUGEPAGE_SIZE if huge else PAGE_SIZE) f_flags.close() f_cgroup.close() return idlememsz if __name__ == "__main__": print "Setting the idle flag for each page..." set_idle() raw_input("Wait until the workload accesses its working set, " "then press Enter") print "Counting idle pages..." idlememsz = count_idle() for dir, subdirs, files in os.walk(CGROUP_MOUNT): ino = os.stat(dir)[stat.ST_INO] print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB" ==== END SCRIPT ==== This patch (of 8): Add page_cgroup_ino() helper to memcg. This function returns the inode number of the closest online ancestor of the memory cgroup a page is charged to. It is required for exporting information about which page is charged to which cgroup to userspace, which will be introduced by a following patch. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10zswap: update docs for runtime-changeable attributesDan Streetman1-8/+28
Change the Documentation/vm/zswap.txt doc to indicate that the "zpool" and "compressor" params are now changeable at runtime. Signed-off-by: Dan Streetman <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10zswap: change zpool/compressor at runtimeDan Streetman1-13/+129
Update the zpool and compressor parameters to be changeable at runtime. When changed, a new pool is created with the requested zpool/compressor, and added as the current pool at the front of the pool list. Previous pools remain in the list only to remove existing compressed pages from. The old pool(s) are removed once they become empty. Signed-off-by: Dan Streetman <[email protected]> Acked-by: Seth Jennings <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10zswap: dynamic pool creationDan Streetman1-143/+405
Add dynamic creation of pools. Move the static crypto compression per-cpu transforms into each pool. Add a pointer to zswap_entry to the pool it's in. This is required by the following patch which enables changing the zswap zpool and compressor params at runtime. [[email protected]: fix merge snafus] Signed-off-by: Dan Streetman <[email protected]> Acked-by: Seth Jennings <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10zpool: add zpool_has_pool()Dan Streetman2-0/+35
This series makes creation of the zpool and compressor dynamic, so that they can be changed at runtime. This makes using/configuring zswap easier, as before this zswap had to be configured at boot time, using boot params. This uses a single list to track both the zpool and compressor together, although Seth had mentioned an alternative which is to track the zpools and compressors using separate lists. In the most common case, only a single zpool and single compressor, using one list is slightly simpler than using two lists, and for the uncommon case of multiple zpools and/or compressors, using one list is slightly less simple (and uses slightly more memory, probably) than using two lists. This patch (of 4): Add zpool_has_pool() function, indicating if the specified type of zpool is available (i.e. zsmalloc or zbud). This allows checking if a pool is available, without actually trying to allocate it, similar to crypto_has_alg(). This is used by a following patch to zswap that enables the dynamic runtime creation of zswap zpools. Signed-off-by: Dan Streetman <[email protected]> Acked-by: Seth Jennings <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10tcp_cubic: better follow cubic curve after idle periodEric Dumazet1-0/+16
Jana Iyengar found an interesting issue on CUBIC : The epoch is only updated/reset initially and when experiencing losses. The delta "t" of now - epoch_start can be arbitrary large after app idle as well as the bic_target. Consequentially the slope (inverse of ca->cnt) would be really large, and eventually ca->cnt would be lower-bounded in the end to 2 to have delayed-ACK slow-start behavior. This particularly shows up when slow_start_after_idle is disabled as a dangerous cwnd inflation (1.5 x RTT) after few seconds of idle time. Jana initial fix was to reset epoch_start if app limited, but Neal pointed out it would ask the CUBIC algorithm to recalculate the curve so that we again start growing steeply upward from where cwnd is now (as CUBIC does just after a loss). Ideally we'd want the cwnd growth curve to be the same shape, just shifted later in time by the amount of the idle period. Reported-by: Jana Iyengar <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Sangtae Ha <[email protected]> Cc: Lawrence Brakmo <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-09-10tcp: generate CA_EVENT_TX_START on data framesNeal Cardwell1-3/+3
Issuing a CC TX_START event on control frames like pure ACK is a waste of time, as a CC should not care. Following patch needs this change, as we want CUBIC to properly track idle time at a low cost, with a single TX_START being generated. Yuchung might slightly refine the condition triggering TX_START on a followup patch. Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Cc: Jana Iyengar <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Sangtae Ha <[email protected]> Cc: Lawrence Brakmo <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-09-10xen-netfront: respect user provided max_queuesWei Liu1-2/+5
Originally that parameter was always reset to num_online_cpus during module initialisation, which renders it useless. The fix is to only set max_queues to num_online_cpus when user has not provided a value. Signed-off-by: Wei Liu <[email protected]> Cc: David Vrabel <[email protected]> Reviewed-by: David Vrabel <[email protected]> Tested-by: David Vrabel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-09-10xen-netback: respect user provided max_queuesWei Liu1-2/+5
Originally that parameter was always reset to num_online_cpus during module initialisation, which renders it useless. The fix is to only set max_queues to num_online_cpus when user has not provided a value. Reported-by: Johnny Strom <[email protected]> Signed-off-by: Wei Liu <[email protected]> Reviewed-by: David Vrabel <[email protected]> Acked-by: Ian Campbell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-09-10r8169: Fix sleeping function called during get_stats64, v2Corinna Vinschen1-83/+54
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=104031 Fixes: 6e85d5ad36a26debc23a9a865c029cbe242b2dc8 Based on the discussion starting at http://www.spinics.net/lists/netdev/msg342193.html Tested locally on RTL8168evl/8111evl with various concurrent processes accessing /proc/net/dev while changing the link state as well as removing/reloading the r8169 module. Signed-off-by: Corinna Vinschen <[email protected]> Tested-by: poma <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-09-10drm/i915: Allow DSI dual link to be configured on any pipeGaurav K Singh1-5/+4
Just like single link MIPI panels, similarly for dual link panels, pipe to be configured is based on the DVO port from VBT Block 2. In hardware, Port A is mapped with Pipe A and Port C is mapped with Pipe B. This issue got introduced in - commit 7e9804fdcffc650515c60f524b8b2076ee59e710 Author: Jani Nikula <[email protected]> Date: Fri Jan 16 14:27:23 2015 +0200 drm/i915/dsi: add drm mipi dsi host support Cc: [email protected] # v4.0 Signed-off-by: Gaurav K Singh <[email protected]> Reviewed-by: Ville Syrjälä <[email protected]> Signed-off-by: Jani Nikula <[email protected]>
2015-09-10drm/i915: Don't try to use DDR DVFS on CHV when disabled in the BIOSVille Syrjälä2-13/+31
If one disables DDR DVFS in the BIOS, Punit will apparently ignores all DDR DVFS request. Currently we assume that DDR DVFS is always operational, which leads to errors in dmesg when the DDR DVFS requests time out. Fix the problem by gently prodding Punit during driver load to find out whether it will respond to DDR DVFS requests. If the request times out, we assume that DDR DVFS has been permanenly disabled in the BIOS and no longer perster the Punit about it. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=91629 Signed-off-by: Ville Syrjälä <[email protected]> Reviewed-by: Clint Taylor <[email protected]> Tested-by: Clint Taylor <[email protected]> Signed-off-by: Jani Nikula <[email protected]>
2015-09-10drm/i915: Fix CSR MMIO address checkTakashi Iwai1-1/+1
Fix a wrong logical AND (&&) used for the range check of CSR MMIO. Spotted nicely by gcc -Wlogical-op flag: drivers/gpu/drm/i915/intel_csr.c: In function ‘finish_csr_load’: drivers/gpu/drm/i915/intel_csr.c:353:41: warning: logical ‘and’ of mutually exclusive tests is always false [-Wlogical-op] Fixes: eb805623d8b1 ('drm/i915/skl: Add support to load SKL CSR firmware.') Cc: <[email protected]> # v4.2 Signed-off-by: Takashi Iwai <[email protected]> Reviewed-by: Daniel Vetter <[email protected]> Reviewed-by: Animesh Manna <[email protected]> Signed-off-by: Jani Nikula <[email protected]>
2015-09-09ether: add IEEE 1722 ethertype - TSNHenrik Austad1-0/+1
IEEE 1722 describes AVB (later renamed to TSN - Time Sensitive Networking), a protocol, encapsualtion and synchronization to utilize standard networks for audio/video (and later other time-sensitive) streams. This standard uses ethertype 0x22F0. http://standards.ieee.org/develop/regauth/ethertype/eth.txt This is a respin of a previous patch ("ether: add AVB frame type ETH_P_AVB") CC: "David S. Miller" <[email protected]> CC: [email protected] CC: [email protected] CC: [email protected] Signed-off-by: Henrik Austad <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-09-10elf-em.h: move EM_MICROBLAZE to the common headerMike Frysinger2-1/+3
The linux/audit.h header uses EM_MICROBLAZE in order to define AUDIT_ARCH_MICROBLAZE, but it's only available in the microblaze asm headers. Move it to the common elf-em.h header so that the define can be used on non-microblaze systems. Otherwise we get build errors that EM_MICROBLAZE isn't defined when we try to use the AUDIT_ARCH_MICROBLAZE symbol. Signed-off-by: Mike Frysinger <[email protected]> Signed-off-by: Michal Simek <[email protected]>
2015-09-09netlink, mmap: fix edge-case leakages in nf queue zero-copyDaniel Borkmann3-10/+26
When netlink mmap on receive side is the consumer of nf queue data, it can happen that in some edge cases, we write skb shared info into the user space mmap buffer: Assume a possible rx ring frame size of only 4096, and the network skb, which is being zero-copied into the netlink skb, contains page frags with an overall skb->len larger than the linear part of the netlink skb. skb_zerocopy(), which is generic and thus not aware of the fact that shared info cannot be accessed for such skbs then tries to write and fill frags, thus leaking kernel data/pointers and in some corner cases possibly writing out of bounds of the mmap area (when filling the last slot in the ring buffer this way). I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has an advertised length larger than 4096, where the linear part is visible at the slot beginning, and the leaked sizeof(struct skb_shared_info) has been written to the beginning of the next slot (also corrupting the struct nl_mmap_hdr slot header incl. status etc), since skb->end points to skb->data + ring->frame_size - NL_MMAP_HDRLEN. The fix adds and lets __netlink_alloc_skb() take the actual needed linear room for the network skb + meta data into account. It's completely irrelevant for non-mmaped netlink sockets, but in case mmap sockets are used, it can be decided whether the available skb_tailroom() is really large enough for the buffer, or whether it needs to internally fallback to a normal alloc_skb(). >From nf queue side, the information whether the destination port is an mmap RX ring is not really available without extra port-to-socket lookup, thus it can only be determined in lower layers i.e. when __netlink_alloc_skb() is called that checks internally for this. I chose to add the extra ldiff parameter as mmap will then still work: We have data_len and hlen in nfqnl_build_packet_message(), data_len is the full length (capped at queue->copy_range) for skb_zerocopy() and hlen some possible part of data_len that needs to be copied; the rem_len variable indicates the needed remaining linear mmap space. The only other workaround in nf queue internally would be after allocation time by f.e. cap'ing the data_len to the skb_tailroom() iff we deal with an mmap skb, but that would 1) expose the fact that we use a mmap skb to upper layers, and 2) trim the skb where we otherwise could just have moved the full skb into the normal receive queue. After the patch, in my test case the ring slot doesn't fit and therefore shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and thus needs to be picked up via recv(). Fixes: 3ab1f683bf8b ("nfnetlink: add support for memory mapped netlink") Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>