aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2015-09-10fs: Don't dump core if the corefile would become world-readable.Jann Horn1-2/+6
On a filesystem like vfat, all files are created with the same owner and mode independent of who created the file. When a vfat filesystem is mounted with root as owner of all files and read access for everyone, root's processes left world-readable coredumps on it (but other users' processes only left empty corefiles when given write access because of the uid mismatch). Given that the old behavior was inconsistent and insecure, I don't see a problem with changing it. Now, all processes refuse to dump core unless the resulting corefile will only be readable by their owner. Signed-off-by: Jann Horn <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10fs: if a coredump already exists, unlink and recreate with O_EXCLJann Horn1-6/+32
It was possible for an attacking user to trick root (or another user) into writing his coredumps into an attacker-readable, pre-existing file using rename() or link(), causing the disclosure of secret data from the victim process' virtual memory. Depending on the configuration, it was also possible to trick root into overwriting system files with coredumps. Fix that issue by never writing coredumps into existing files. Requirements for the attack: - The attack only applies if the victim's process has a nonzero RLIMIT_CORE and is dumpable. - The attacker can trick the victim into coredumping into an attacker-writable directory D, either because the core_pattern is relative and the victim's cwd is attacker-writable or because an absolute core_pattern pointing to a world-writable directory is used. - The attacker has one of these: A: on a system with protected_hardlinks=0: execute access to a folder containing a victim-owned, attacker-readable file on the same partition as D, and the victim-owned file will be deleted before the main part of the attack takes place. (In practice, there are lots of files that fulfill this condition, e.g. entries in Debian's /var/lib/dpkg/info/.) This does not apply to most Linux systems because most distros set protected_hardlinks=1. B: on a system with protected_hardlinks=1: execute access to a folder containing a victim-owned, attacker-readable and attacker-writable file on the same partition as D, and the victim-owned file will be deleted before the main part of the attack takes place. (This seems to be uncommon.) C: on any system, independent of protected_hardlinks: write access to a non-sticky folder containing a victim-owned, attacker-readable file on the same partition as D (This seems to be uncommon.) The basic idea is that the attacker moves the victim-owned file to where he expects the victim process to dump its core. The victim process dumps its core into the existing file, and the attacker reads the coredump from it. If the attacker can't move the file because he does not have write access to the containing directory, he can instead link the file to a directory he controls, then wait for the original link to the file to be deleted (because the kernel checks that the link count of the corefile is 1). A less reliable variant that requires D to be non-sticky works with link() and does not require deletion of the original link: link() the file into D, but then unlink() it directly before the kernel performs the link count check. On systems with protected_hardlinks=0, this variant allows an attacker to not only gain information from coredumps, but also clobber existing, victim-writable files with coredumps. (This could theoretically lead to a privilege escalation.) Signed-off-by: Jann Horn <[email protected]> Cc: Kees Cook <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kmod: handle UMH_WAIT_PROC from system unbound workqueueFrederic Weisbecker1-24/+20
The UMH_WAIT_PROC handler runs in its own thread in order to make sure that waiting for the exec kernel thread completion won't block other usermodehelper queued jobs. On older workqueue implementations, worklets couldn't sleep without blocking the rest of the queue. But now the workqueue subsystem handles that. Khelper still had the older limitation due to its singlethread properties but we replaced it to system unbound workqueues. Those are affine to the current node and can block up to some number of instances. They are a good candidate to handle UMH_WAIT_PROC assuming that we have enough system unbound workers to handle lots of parallel usermodehelper jobs. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kmod: use system_unbound_wq instead of khelperFrederic Weisbecker3-26/+17
We need to launch the usermodehelper kernel threads with the widest affinity and this is partly why we use khelper. This workqueue has unbound properties and thus a wide affinity inherited by all its children. Now khelper also has special properties that we aren't much interested in: ordered and singlethread. There is really no need about ordering as all we do is creating kernel threads. This can be done concurrently. And singlethread is a useless limitation as well. The workqueue engine already proposes generic unbound workqueues that don't share these useless properties and handle well parallel jobs. The only worrysome specific is their affinity to the node of the current CPU. It's fine for creating the usermodehelper kernel threads but those inherit this affinity for longer jobs such as requesting modules. This patch proposes to use these node affine unbound workqueues assuming that a node is sufficient to handle several parallel usermodehelper requests. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kmod: add up-to-date explanations on the purpose of each asynchronous levelsFrederic Weisbecker1-8/+24
There seem to be quite some confusions on the comments, likely due to changes that came after them. Now since it's very non obvious why we have 3 levels of asynchronous code to implement usermodehelpers, it's important to comment in detail the reason of this layout. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kmod: remove unecessary explicit wide CPU affinity settingFrederic Weisbecker1-3/+0
Khelper is affine to all CPUs. Now since it creates the call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide affinity. As such explicitly forcing a wide affinity from those kernel threads is like a no-op. Just remove it. It's needless and it breaks CPU isolation users who rely on workqueue affinity tuning. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kmod: bunch of internal functions renamesFrederic Weisbecker1-13/+17
This patchset does a bunch of cleanups and converts khelper to use system unbound workqueues. The 3 first patches should be uncontroversial. The last 2 patches are debatable. Kmod creates kernel threads that perform userspace jobs and we want those to have a large affinity in order not to contend busy CPUs. This is (partly) why we use khelper which has a wide affinity that the kernel threads it create can inherit from. Now khelper is a dedicated workqueue that has singlethread properties which we aren't interested in. Hence those two debatable changes: _ We would like to use generic workqueues. System unbound workqueues are a very good candidate but they are not wide affine, only node affine. Now probably a node is enough to perform many parallel kmod jobs. _ We would like to remove the wait_for_helper kernel thread (UMH_WAIT_PROC handler) to use the workqueue. It means that if the workqueue blocks, and no other worker can take pending kmod request, we can be screwed. Now if we have 512 threads, this should be enough. This patch (of 5): Underscores on function names aren't much verbose to explain the purpose of a function. And kmod has interesting such flavours. Lets rename the following functions: * __call_usermodehelper -> call_usermodehelper_exec_work * ____call_usermodehelper -> call_usermodehelper_exec_async * wait_for_helper -> call_usermodehelper_exec_sync Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kmod: correct documentation of return status of request_moduleNeilBrown1-4/+5
If request_module() successfully runs modprobe, but modprobe exits with a non-zero status, then the return value from request_module() will be that (positive) error status. So the return from request_module can be: negative errno zero for success positive exit code. Signed-off-by: NeilBrown <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10hfs: fix B-tree corruption after insertion at position 0Hin-Tak Leung1-9/+11
Fix B-tree corruption when a new record is inserted at position 0 in the node in hfs_brec_insert(). This is an identical change to the corresponding hfs b-tree code to Sergei Antonov's "hfsplus: fix B-tree corruption after insertion at position 0", to keep similar code paths in the hfs and hfsplus drivers in sync, where appropriate. Signed-off-by: Hin-Tak Leung <[email protected]> Cc: Sergei Antonov <[email protected]> Cc: Joe Perches <[email protected]> Reviewed-by: Vyacheslav Dubeyko <[email protected]> Cc: Anton Altaparmakov <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10hfs,hfsplus: cache pages correctly between bnode_create and bnode_freeHin-Tak Leung2-8/+4
Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and hfs_bnode_find() for finding or creating pages corresponding to an inode) are immediately kmap()'ed and used (both read and write) and kunmap()'ed, and should not be page_cache_release()'ed until hfs_bnode_free(). This patch fixes a problem I first saw in July 2012: merely running "du" on a large hfsplus-mounted directory a few times on a reasonably loaded system would get the hfsplus driver all confused and complaining about B-tree inconsistencies, and generates a "BUG: Bad page state". Most recently, I can generate this problem on up-to-date Fedora 22 with shipped kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller mounts) and "du /mnt" simultaneously on two windows, where /mnt is a lightly-used QEMU VM image of the full Mac OS X 10.9: $ df -i / /home /mnt Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/fedora-root 3276800 551665 2725135 17% / /dev/mapper/fedora-home 52879360 716221 52163139 2% /home /dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt After applying the patch, I was able to run "du /" (60+ times) and "du /mnt" (150+ times) continuously and simultaneously for 6+ hours. There are many reports of the hfsplus driver getting confused under load and generating "BUG: Bad page state" or other similar issues over the years. [1] The unpatched code [2] has always been wrong since it entered the kernel tree. The only reason why it gets away with it is that the kmap/memcpy/kunmap follow very quickly after the page_cache_release() so the kernel has not had a chance to reuse the memory for something else, most of the time. The current RW driver appears to have followed the design and development of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec 2001) had a B-tree node-centric approach to read_cache_page()/page_cache_release() per bnode_get()/bnode_put(), migrating towards version 0.2 (June 2002) of caching and releasing pages per inode extents. When the current RW code first entered the kernel [2] in 2005, there was an REF_PAGES conditional (and "//" commented out code) to switch between B-node centric paging to inode-centric paging. There was a mistake with the direction of one of the REF_PAGES conditionals in __hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were removed, but a page_cache_release() was mistakenly left in (propagating the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out page_cache_release() in bnode_release() (which should be spanned by !REF_PAGES) was never enabled. References: [1]: Michael Fox, Apr 2013 http://www.spinics.net/lists/linux-fsdevel/msg63807.html ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'") Sasha Levin, Feb 2015 http://lkml.org/lkml/2015/2/20/85 ("use after free") https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887 https://bugzilla.kernel.org/show_bug.cgi?id=42342 https://bugzilla.kernel.org/show_bug.cgi?id=63841 https://bugzilla.kernel.org/show_bug.cgi?id=78761 [2]: http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\ fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db commit d1081202f1d0ee35ab0beb490da4b65d4bc763db Author: Andrew Morton <[email protected]> Date: Wed Feb 25 16:17:36 2004 -0800 [PATCH] HFS rewrite http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\ fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd commit 91556682e0bf004d98a529bf829d339abb98bbbd Author: Andrew Morton <[email protected]> Date: Wed Feb 25 16:17:48 2004 -0800 [PATCH] HFS+ support [3]: http://sourceforge.net/projects/linux-hfsplus/ http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/ http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/ http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\ fs/hfsplus/bnode.c?r1=1.4&r2=1.5 Date: Thu Jun 6 09:45:14 2002 +0000 Use buffer cache instead of page cache in bnode.c. Cache inode extents. [4]: http://git.kernel.org/cgit/linux/kernel/git/\ stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d commit a5e3985fa014029eb6795664c704953720cc7f7d Author: Roman Zippel <[email protected]> Date: Tue Sep 6 15:18:47 2005 -0700 [PATCH] hfs: remove debug code Signed-off-by: Hin-Tak Leung <[email protected]> Signed-off-by: Sergei Antonov <[email protected]> Reviewed-by: Anton Altaparmakov <[email protected]> Reported-by: Sasha Levin <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Vyacheslav Dubeyko <[email protected]> Cc: Sougata Santra <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10fs/coda: fix readlink buffer overflowJan Harkes1-3/+3
Dan Carpenter discovered a buffer overflow in the Coda file system readlink code. A userspace file system daemon can return a 4096 byte result which then triggers a one byte write past the allocated readlink result buffer. This does not trigger with an unmodified Coda implementation because Coda has a 1024 byte limit for symbolic links, however other userspace file systems using the Coda kernel module could be affected. Although this is an obvious overflow, I don't think this has to be handled as too sensitive from a security perspective because the overflow is on the Coda userspace daemon side which already needs root to open Coda's kernel device and to mount the file system before we get to the point that links can be read. [[email protected]: coding-style fixes] Signed-off-by: Jan Harkes <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: add constant comparison on left side testJoe Perches1-0/+29
"CONST <comparison> variable" checks like: if (NULL != foo) and while (0 < bar(...)) where a constant (or what appears to be a constant like an upper case identifier) is on the left of a comparison are generally preferred to be written using the constant on the right side like: if (foo != NULL) and while (bar(...) > 0) Add a test for this. Add a --fix option too, but only do it when the code is immediately surrounded by parentheses to avoid misfixing things like "(0 < bar() + constant)" Signed-off-by: Joe Perches <[email protected]> Cc: Nicolas Morey Chaisemartin <[email protected]> Cc: Viresh Kumar <[email protected]> Cc: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: add __pmem to $Sparse annotationsJoe Perches1-0/+1
commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of persistent memory updates") added a new __pmem annotation for sparse verification. Add __pmem to the $Sparse variable so checkpatch can appropriately ignore uses of this attribute too. Signed-off-by: Joe Perches <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Andy Whitcroft <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: fix left brace warningEddie Kovsky1-4/+4
Using checkpatch.pl with Perl 5.22.0 generates the following warning: Unescaped left brace in regex is deprecated, passed through in regex; This patch fixes the warnings by escaping occurrences of the left brace inside the regular expression. Signed-off-by: Eddie Kovsky <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: avoid some commit message long line warningsJoe Perches1-3/+27
Fixes: and Link: lines may exceed 75 chars in the commit log. So too can stack dump and dmesg lines and lines that seem like filenames. And Fixes: lines don't need to have a "commit" prefix before the commit id. Add exceptions for these types of lines. Signed-off-by: Joe Perches <[email protected]> Reported-by: Paul Bolle <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: emit an error on formats with 0x%<decimal>Joe Perches1-2/+6
Using 0x%d is wrong. Emit a message when it happens. Miscellanea: Improve the %Lu warning to match formats like %16Lu. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: make --strict the default for drivers/staging files and patchesJoe Perches1-1/+1
Making --strict the default for staging may help some people submit patches without obvious defects. Signed-off-by: Joe Perches <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: always check block comment stylesJoe Perches1-8/+11
Some of the block comment tests that are used only for networking are appropriate for all patches. For example, these styles are not encouraged: /* block comment without introductory * */ and /* * block comment with line terminating */ Remove the networking specific test and add comments. There are some infrequent false positives where code is lazily commented out using /* and */ rather than using #if 0/#endif blocks like: /* case foo: case bar: */ case baz: Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: report the right line # when using --emacs and --fileJoe Perches1-1/+5
commit 34d8815f9512 ("checkpatch: add --showfile to allow input via pipe to show filenames") broke the --emacs with --file option. Fix it. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: add some <foo>_destroy functions to NEEDLESS_IF testsJoe Perches1-4/+28
Sergey Senozhatsky has modified several destroy functions that can now be called with NULL values. - kmem_cache_destroy() - mempool_destroy() - dma_pool_destroy() Update checkpatch to warn when those functions are preceded by an if. Update checkpatch to --fix all the calls too only when the code style form is using leading tabs. from: if (foo) <func>(foo); to: <func>(foo); Signed-off-by: Joe Perches <[email protected]> Tested-by: Sergey Senozhatsky <[email protected]> Cc: David Rientjes <[email protected]> Cc: Julia Lawall <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: Allow longer declaration macrosJoe Perches1-1/+1
Some really long declaration macros exist. For instance; DEFINE_DMA_BUF_EXPORT_INFO(exp_info); and DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(name, description) Increase the limit from 2 words to 6 after DECLARE/DEFINE uses. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: improve SUSPECT_CODE_INDENT testJoe Perches1-6/+15
Many lines exist like if (foo) bar; where the tabbed indentation of the branch is not one more than the "if" line above it. checkpatch should emit a warning on those lines. Miscellenea: o Remove comments from branch blocks o Skip blank lines in block Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: add warning on BUG/BUG_ON useJoe Perches1-6/+8
Using BUG/BUG_ON crashes the kernel and is just unfriendly. Enable code that emits a warning on BUG/BUG_ON use. Make the code emit the message at WARNING level when scanning a patch and at CHECK level when scanning files so that script users don't feel an obligation to fix code that might be above their pay grade. Signed-off-by: Joe Perches <[email protected]> Reported-by: Geert Uytterhoeven <[email protected]> Tested-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10checkpatch: warn on bare SHA-1 commit IDs in commit logsJoe Perches1-3/+12
Commit IDs should have commit descriptions too. Warn when a 12 to 40 byte SHA-1 is used in commit logs. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/test_kasan.c: make kmalloc_oob_krealloc_less more correctlyWang Long1-1/+1
In kmalloc_oob_krealloc_less, I think it is better to test the size2 boundary. If we do not call krealloc, the access of position size1 will still cause out-of-bounds and access of position size2 does not. After call krealloc, the access of position size2 cause out-of-bounds. So using size2 is more correct. Signed-off-by: Wang Long <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/test_kasan.c: fix a typoWang Long1-2/+2
Signed-off-by: Wang Long <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/string_helpers: rename "esc" arg to "only"Kees Cook2-14/+14
To further clarify the purpose of the "esc" argument, rename it to "only" to reflect that it is a limit, not a list of additional characters to escape. Signed-off-by: Kees Cook <[email protected]> Suggested-by: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/string_helpers: clarify esc arg in string_escape_memKees Cook1-4/+6
The esc argument is used to reduce which characters will be escaped. For example, using " " with ESCAPE_SPACE will not produce any escaped spaces. Signed-off-by: Kees Cook <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Mathias Krause <[email protected]> Cc: James Bottomley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10hexdump: do not print debug dumps for !CONFIG_DEBUGLinus Walleij1-2/+8
print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or dev_dbg() & friends. Currently it will adhere to dynamic debug, but will not stub out prints if CONFIG_DEBUG is not set. Let's make it do the right thing, because I am tired of having my dmesg buffer full of hex dumps on production systems. Signed-off-by: Linus Walleij <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/bitmap.c: bitmap_parselist can accept string with whitespaces on head or ↵Pan Xinhui1-14/+18
tail In __bitmap_parselist we can accept whitespaces on head or tail during every parsing procedure. If input has valid ranges, there is no reason to reject the user. For example, bitmap_parselist(" 1-3, 5, ", &mask, nmaskbits). After separating the string, we get " 1-3", " 5", and " ". It's possible and reasonable to accept such string as long as the parsing result is correct. Signed-off-by: Pan Xinhui <[email protected]> Cc: Yury Norov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Sudeep Holla <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/bitmap.c: fix a special string handling bug in __bitmap_parselistPan Xinhui1-0/+4
If string end with '-', for exapmle, bitmap_parselist("1,0-",&mask, nmaskbits), It is not in a valid pattern, so add a check after loop. Return -EINVAL on such condition. Signed-off-by: Pan Xinhui <[email protected]> Cc: Yury Norov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Sudeep Holla <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10lib/bitmap.c: correct a code style and do some, optimizationPan Xinhui1-3/+4
We can avoid in-loop incrementation of ndigits. Save current totaldigits to ndigits before loop, and check ndigits against totaldigits after the loop. Signed-off-by: Pan Xinhui <[email protected]> Cc: Yury Norov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Sudeep Holla <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10proc: convert to kstrto*()/kstrto*_from_user()Alexey Dobriyan1-49/+21
Convert from manual allocation/copy_from_user/... to kstrto*() family which were designed for exactly that. One case can not be converted to kstrto*_from_user() to make code even more simpler because of whitespace stripping, oh well... Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kstrto*: accept "-0" for signed conversionAlexey Dobriyan2-6/+2
strtol(3) et al accept "-0", so should we. Signed-off-by: Alexey Dobriyan <[email protected]> Cc: David Howells <[email protected]> Cc: Jan Kara <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10MAINTAINERS/CREDITS: mark MaxRAID as Orphan, move Anil Ravindranath to CREDITSJoe Perches2-2/+5
Anil's email address bounces and he hasn't had a signoff in over 5 years. Signed-off-by: Joe Perches <[email protected]> Cc: James Bottomley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10include/linux/printk.h: include pr_fmt in pr_debug_ratelimitedJason A. Donenfeld1-2/+2
The other two implementations of pr_debug_ratelimited include pr_fmt, along with every other pr_* function. But pr_debug_ratelimited forgot to add it with the CONFIG_DYNAMIC_DEBUG implementation. This patch unifies the behavior. Signed-off-by: Jason A. Donenfeld <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kernel/cred.c: remove unnecessary kdebug atomic readsJoe Perches1-4/+9
Commit e0e817392b9a ("CRED: Add some configurable debugging [try #6]") added the kdebug mechanism to this file back in 2009. The kdebug macro calls no_printk which always evaluates arguments. Most of the kdebug uses have an unnecessary call of atomic_read(&cred->usage) Make the kdebug macro do nothing by defining it with do { if (0) no_printk(...); } while (0) when not enabled. $ size kernel/cred.o* (defconfig x86-64) text data bss dec hex filename 2748 336 8 3092 c14 kernel/cred.o.new 2788 336 8 3132 c3c kernel/cred.o.old Miscellanea: o Neaten the #define kdebug macros while there Signed-off-by: Joe Perches <[email protected]> Cc: David Howells <[email protected]> Cc: James Morris <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10kernel/extable.c: remove duplicated includeWei Yongjun1-1/+0
Signed-off-by: Wei Yongjun <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10include/linux/poison.h: remove not-used poison pointer macrosVasily Kulikov1-7/+0
Signed-off-by: Vasily Kulikov <[email protected]> Cc: Solar Designer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10include/linux/poison.h: fix LIST_POISON{1,2} offsetVasily Kulikov1-2/+2
Poison pointer values should be small enough to find a room in non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space" is located starting from 0x0. Given unprivileged users cannot mmap anything below mmap_min_addr, it should be safe to use poison pointers lower than mmap_min_addr. The current poison pointer values of LIST_POISON{1,2} might be too big for mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu uses only 0x10000). There is little point to use such a big value given the "poison pointer space" below 1 MB is not yet exhausted. Changing it to a smaller value solves the problem for small mmap_min_addr setups. The values are suggested by Solar Designer: http://www.openwall.com/lists/oss-security/2015/05/02/6 Signed-off-by: Vasily Kulikov <[email protected]> Cc: Solar Designer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10proc: change proc_subdir_lock to a rwlockWaiman Long1-22/+22
The proc_subdir_lock spinlock is used to allow only one task to make change to the proc directory structure as well as looking up information in it. However, the information lookup part can actually be entered by more than one task as the pde_get() and pde_put() reference count update calls in the critical sections are atomic increment and decrement respectively and so are safe with concurrent updates. The x86 architecture has already used qrwlock which is fair and other architectures like ARM are in the process of switching to qrwlock. So unfairness shouldn't be a concern in that conversion. This patch changed the proc_subdir_lock to a rwlock in order to enable concurrent lookup. The following functions were modified to take a write lock: - proc_register() - remove_proc_entry() - remove_proc_subtree() The following functions were modified to take a read lock: - xlate_proc_name() - proc_lookup_de() - proc_readdir_de() A parallel /proc filesystem search with the "find" command (1000 threads) was run on a 4-socket Haswell-EX box (144 threads). Before the patch, the parallel search took about 39s. After the patch, the parallel find took only 25s, a saving of about 14s. The micro-benchmark that I used was artificial, but it was used to reproduce an exit hanging problem that I saw in real application. In fact, only allow one task to do a lookup seems too limiting to me. Signed-off-by: Waiman Long <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Nicolas Dichtel <[email protected]> Cc: Al Viro <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Douglas Hatch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10procfs: always expose /proc/<pid>/map_files/ and make it readableCalvin Owens1-19/+24
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is only exposed if CONFIG_CHECKPOINT_RESTORE is set. Each mapped file region gets a symlink in /proc/<pid>/map_files/ corresponding to the virtual address range at which it is mapped. The symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them to the backing file even if that backing file has been unlinked. Currently, files which are mapped, unlinked, and closed are impossible to stat() from userspace. Exposing /proc/<pid>/map_files/ closes this functionality "hole". Not being able to stat() such files makes noticing and explicitly accounting for the space they use on the filesystem impossible. You can work around this by summing up the space used by every file in the filesystem and subtracting that total from what statfs() tells you, but that obviously isn't great, and it becomes unworkable once your filesystem becomes large enough. This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and adjusts the permissions enforced on it as follows: * proc_map_files_lookup() * proc_map_files_readdir() * map_files_d_revalidate() Remove the CAP_SYS_ADMIN restriction, leaving only the current restriction requiring PTRACE_MODE_READ. The information made available to userspace by these three functions is already available in /proc/PID/maps with MODE_READ, so I don't see any reason to limit them any further (see below for more detail). * proc_map_files_follow_link() This stub has been added, and requires that the user have CAP_SYS_ADMIN in order to follow the links in map_files/, since there was concern on LKML both about the potential for bypassing permissions on ancestor directories in the path to files pointed to, and about what happens with more exotic memory mappings created by some drivers (ie dma-buf). In older versions of this patch, I changed every permission check in the four functions above to enforce MODE_ATTACH instead of MODE_READ. This was an oversight on my part, and after revisiting the discussion it seems that nobody was concerned about anything outside of what is made possible by ->follow_link(). So in this version, I've left the checks for PTRACE_MODE_READ as-is. [[email protected]: catch up with concurrent proc_pid_follow_link() changes] Signed-off-by: Calvin Owens <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10proc: add cond_resched to /proc/kpage* read/write loopVladimir Davydov1-0/+6
Reading/writing a /proc/kpage* file may take long on machines with a lot of RAM installed. Signed-off-by: Vladimir Davydov <[email protected]> Suggested-by: Andres Lagar-Cavilla <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10proc: export idle flag via kpageflagsVladimir Davydov3-0/+11
As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags is that one can easily filter dirty and/or unevictable pages while estimating the size of unused memory. Note that idle flag read from /proc/kpageflags may be stale in case the page was accessed via a PTE, because it would be too costly to iterate over all page mappings on each /proc/kpageflags read to provide an up-to-date value. To make sure the flag is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10mm: introduce idle page trackingVladimir Davydov17-3/+512
Knowing the portion of memory that is not used by a certain application or memory cgroup (idle memory) can be useful for partitioning the system efficiently, e.g. by setting memory cgroup limits appropriately. Currently, the only means to estimate the amount of idle memory provided by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the access bit for all pages mapped to a particular process by writing 1 to clear_refs, wait for some time, and then count smaps:Referenced. However, this method has two serious shortcomings: - it does not count unmapped file pages - it affects the reclaimer logic To overcome these drawbacks, this patch introduces two new page flags, Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap. A page's Idle flag can only be set from userspace by setting bit in /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page, and it is cleared whenever the page is accessed either through page tables (it is cleared in page_referenced() in this case) or using the read(2) system call (mark_page_accessed()). Thus by setting the Idle flag for pages of a particular workload, which can be found e.g. by reading /proc/PID/pagemap, waiting for some time to let the workload access its working set, and then reading the bitmap file, one can estimate the amount of pages that are not used by the workload. The Young page flag is used to avoid interference with the memory reclaimer. A page's Young flag is set whenever the Access bit of a page table entry pointing to the page is cleared by writing to the bitmap file. If page_referenced() is called on a Young page, it will add 1 to its return value, therefore concealing the fact that the Access bit was cleared. Note, since there is no room for extra page flags on 32 bit, this feature uses extended page flags when compiled on 32 bit. [[email protected]: fix build] [[email protected]: kpageidle requires an MMU] [[email protected]: decouple from page-flags rework] Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10mmu-notifier: add clear_young callbackVladimir Davydov3-0/+92
In the scope of the idle memory tracking feature, which is introduced by the following patch, we need to clear the referenced/accessed bit not only in primary, but also in secondary ptes. The latter is required in order to estimate wss of KVM VMs. At the same time we want to avoid flushing tlb, because it is quite expensive and it won't really affect the final result. Currently, there is no function for clearing pte young bit that would meet our requirements, so this patch introduces one. To achieve that we have to add a new mmu-notifier callback, clear_young, since there is no method for testing-and-clearing a secondary pte w/o flushing tlb. The new method is not mandatory and currently only implemented by KVM. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10proc: add kpagecgroup fileVladimir Davydov2-1/+58
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Having this information is useful for estimating a cgroup working set size. The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10memcg: zap try_get_mem_cgroup_from_pageVladimir Davydov2-44/+13
It is only used in mem_cgroup_try_charge, so fold it in and zap it. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10hwpoison: use page_cgroup_ino for filtering by memcgVladimir Davydov2-18/+3
Hwpoison allows to filter pages by memory cgroup ino. Currently, it calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and then its ino using cgroup_ino, but now we have a helper method for that, page_cgroup_ino, so use it instead. This patch also loosens the hwpoison memcg filter dependency rules - it makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because hwpoison memcg filter does not require anything (nor it used to) from CONFIG_MEMCG_SWAP side. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-10memcg: add page_cgroup_ino helperVladimir Davydov2-0/+29
This patchset introduces a new user API for tracking user memory pages that have not been used for a given period of time. The purpose of this is to provide the userspace with the means of tracking a workload's working set, i.e. the set of pages that are actively used by the workload. Knowing the working set size can be useful for partitioning the system more efficiently, e.g. by tuning memory cgroup limits appropriately, or for job placement within a compute cluster. ==== USE CASES ==== The unified cgroup hierarchy has memory.low and memory.high knobs, which are defined as the low and high boundaries for the workload working set size. However, the working set size of a workload may be unknown or change in time. With this patch set, one can periodically estimate the amount of memory unused by each cgroup and tune their memory.low and memory.high parameters accordingly, therefore optimizing the overall memory utilization. Another use case is balancing workloads within a compute cluster. Knowing how much memory is not really used by a workload unit may help take a more optimal decision when considering migrating the unit to another node within the cluster. Also, as noted by Minchan, this would be useful for per-process reclaim (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle pages only by smart user memory manager. ==== USER API ==== The user API consists of two new files: * /sys/kernel/mm/page_idle/bitmap. This file implements a bitmap where each bit corresponds to a page, indexed by PFN. When the bit is set, the corresponding page is idle. A page is considered idle if it has not been accessed since it was marked idle. To mark a page idle one should set the bit corresponding to the page by writing to the file. A value written to the file is OR-ed with the current bitmap value. Only user memory pages can be marked idle, for other page types input is silently ignored. Writing to this file beyond max PFN results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set. This file can be used to estimate the amount of pages that are not used by a particular workload as follows: 1. mark all pages of interest idle by setting corresponding bits in the /sys/kernel/mm/page_idle/bitmap 2. wait until the workload accesses its working set 3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set * /proc/kpagecgroup. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when CONFIG_MEMCG is set. This file can be used to find all pages (including unmapped file pages) accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one can then estimate the cgroup working set size. For an example of using these files for estimating the amount of unused memory pages per each memory cgroup, please see the script attached below. ==== REASONING ==== The reason to introduce the new user API instead of using /proc/PID/{clear_refs,smaps} is that the latter has two serious drawbacks: - it does not count unmapped file pages - it affects the reclaimer logic The new API attempts to overcome them both. For more details on how it is achieved, please see the comment to patch 6. ==== PATCHSET STRUCTURE ==== The patch set is organized as follows: - patch 1 adds page_cgroup_ino() helper for the sake of /proc/kpagecgroup and patches 2-3 do related cleanup - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is charged to - patch 5 introduces a new mmu notifier callback, clear_young, which is a lightweight version of clear_flush_young; it is used in patch 6 - patch 6 implements the idle page tracking feature, including the userspace API, /sys/kernel/mm/page_idle/bitmap - patch 7 exports idle flag via /proc/kpageflags ==== SIMILAR WORKS ==== Originally, the patch for tracking idle memory was proposed back in 2011 by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main difference between Michel's patch and this one is that Michel implemented a kernel space daemon for estimating idle memory size per cgroup while this patch only provides the userspace with the minimal API for doing the job, leaving the rest up to the userspace. However, they both share the same idea of Idle/Young page flags to avoid affecting the reclaimer logic. ==== PERFORMANCE EVALUATION ==== SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the performance impact introduced by this patch set. Three runs were carried out: - base: kernel without the patch - patched: patched kernel, the feature is not used - patched-active: patched kernel, 1 minute-period daemon is used for tracking idle memory For tracking idle memory, idlememstat utility was used: https://github.com/locker/idlememstat testcase base patched patched-active compiler 537.40 ( 0.00)% 532.26 (-0.96)% 538.31 ( 0.17)% compress 305.47 ( 0.00)% 301.08 (-1.44)% 300.71 (-1.56)% crypto 284.32 ( 0.00)% 282.21 (-0.74)% 284.87 ( 0.19)% derby 411.05 ( 0.00)% 413.44 ( 0.58)% 412.07 ( 0.25)% mpegaudio 189.96 ( 0.00)% 190.87 ( 0.48)% 189.42 (-0.28)% scimark.large 46.85 ( 0.00)% 46.41 (-0.94)% 47.83 ( 2.09)% scimark.small 412.91 ( 0.00)% 415.41 ( 0.61)% 421.17 ( 2.00)% serial 204.23 ( 0.00)% 213.46 ( 4.52)% 203.17 (-0.52)% startup 36.76 ( 0.00)% 35.49 (-3.45)% 35.64 (-3.05)% sunflow 115.34 ( 0.00)% 115.08 (-0.23)% 117.37 ( 1.76)% xml 620.55 ( 0.00)% 619.95 (-0.10)% 620.39 (-0.03)% composite 211.50 ( 0.00)% 211.15 (-0.17)% 211.67 ( 0.08)% time idlememstat: 17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k 448inputs+40outputs (1major+36052minor)pagefaults 0swaps ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ==== #! /usr/bin/python # import os import stat import errno import struct CGROUP_MOUNT = "/sys/fs/cgroup/memory" BUFSIZE = 8 * 1024 # must be multiple of 8 def get_hugepage_size(): with open("/proc/meminfo", "r") as f: for s in f: k, v = s.split(":") if k == "Hugepagesize": return int(v.split()[0]) * 1024 PAGE_SIZE = os.sysconf("SC_PAGE_SIZE") HUGEPAGE_SIZE = get_hugepage_size() def set_idle(): f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE) while True: try: f.write(struct.pack("Q", pow(2, 64) - 1)) except IOError as err: if err.errno == errno.ENXIO: break raise f.close() def count_idle(): f_flags = open("/proc/kpageflags", "rb", BUFSIZE) f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE) with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f: while f.read(BUFSIZE): pass # update idle flag idlememsz = {} while True: s1, s2 = f_flags.read(8), f_cgroup.read(8) if not s1 or not s2: break flags, = struct.unpack('Q', s1) cgino, = struct.unpack('Q', s2) unevictable = (flags >> 18) & 1 huge = (flags >> 22) & 1 idle = (flags >> 25) & 1 if idle and not unevictable: idlememsz[cgino] = idlememsz.get(cgino, 0) + \ (HUGEPAGE_SIZE if huge else PAGE_SIZE) f_flags.close() f_cgroup.close() return idlememsz if __name__ == "__main__": print "Setting the idle flag for each page..." set_idle() raw_input("Wait until the workload accesses its working set, " "then press Enter") print "Counting idle pages..." idlememsz = count_idle() for dir, subdirs, files in os.walk(CGROUP_MOUNT): ino = os.stat(dir)[stat.ST_INO] print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB" ==== END SCRIPT ==== This patch (of 8): Add page_cgroup_ino() helper to memcg. This function returns the inode number of the closest online ancestor of the memory cgroup a page is charged to. It is required for exporting information about which page is charged to which cgroup to userspace, which will be introduced by a following patch. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Andres Lagar-Cavilla <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: David Rientjes <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>