aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2012-01-03should_remove_suid(): inode->i_mode is umode_tAl Viro1-1/+1
Signed-off-by: Al Viro <[email protected]>
2012-01-03shmem, ramfs: propagate umode_t, open-coded S_ISREGAl Viro1-3/+3
Signed-off-by: Al Viro <[email protected]>
2012-01-03switch debugfs to umode_tAl Viro2-2/+2
Signed-off-by: Al Viro <[email protected]>
2012-01-03switch ->mknod() to umode_tAl Viro1-1/+1
Signed-off-by: Al Viro <[email protected]>
2012-01-03switch ->create() to umode_tAl Viro1-1/+1
vfs_create() ignores everything outside of 16bit subset of its mode argument; switching it to umode_t is obviously equivalent and it's the only caller of the method Signed-off-by: Al Viro <[email protected]>
2012-01-03switch vfs_mkdir() and ->mkdir() to umode_tAl Viro1-1/+1
vfs_mkdir() gets int, but immediately drops everything that might not fit into umode_t and that's the only caller of ->mkdir()... Signed-off-by: Al Viro <[email protected]>
2012-01-03fs: move code out of buffer.cAl Viro2-2/+1
Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export kill_bdev as well, so brd doesn't have to open code it. Reduce buffer_head.h requirement accordingly. Removed a rather large comment from invalidate_bdev, as it looked a bit obsolete to bother moving. The small comment replacing it says enough. Signed-off-by: Nick Piggin <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2012-01-03vfs: fix the stupidity with i_dentry in inode destructorsAl Viro1-1/+0
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once(); the cost of taking it into inode_init_always() will be negligible for pipes and sockets and negative for everything else. Not to mention the removal of boilerplate code from ->destroy_inode() instances... Signed-off-by: Al Viro <[email protected]>
2011-12-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-2/+11
2011-12-29mm: hugetlb: fix non-atomic enqueue of huge pageHillf Danton1-1/+1
If a huge page is enqueued under the protection of hugetlb_lock, then the operation is atomic and safe. Signed-off-by: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: <[email protected]> [2.6.37+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-29mm/mempolicy.c: refix mbind_range() vma issueKOSAKI Motohiro1-1/+10
commit 8aacc9f550 ("mm/mempolicy.c: fix pgoff in mbind vma merge") is the slightly incorrect fix. Why? Think following case. 1. map 4 pages of a file at offset 0 [0123] 2. map 2 pages just after the first mapping of the same file but with page offset 2 [0123][23] 3. mbind() 2 pages from the first mapping at offset 2. mbind_range() should treat new vma is, [0123][23] |23| mbind vma but it does [0123][23] |01| mbind vma Oops. then, it makes wrong vma merge and splitting ([01][0123] or similar). This patch fixes it. [testcase] test result - before the patch case4: 126: test failed. expect '2,4', actual '2,2,2' case5: passed case6: passed case7: passed case8: passed case_n: 246: test failed. expect '4,2', actual '1,4' ------------[ cut here ]------------ kernel BUG at mm/filemap.c:135! invalid opcode: 0000 [#4] SMP DEBUG_PAGEALLOC (snip long bug on messages) test result - after the patch case4: passed case5: passed case6: passed case7: passed case8: passed case_n: passed source: mbind_vma_test.c ============================================================ #include <numaif.h> #include <numa.h> #include <sys/mman.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <string.h> static unsigned long pagesize; void* mmap_addr; struct bitmask *nmask; char buf[1024]; FILE *file; char retbuf[10240] = ""; int mapped_fd; char *rubysrc = "ruby -e '\ pid = %d; \ vstart = 0x%llx; \ vend = 0x%llx; \ s = `pmap -q #{pid}`; \ rary = []; \ s.each_line {|line|; \ ary=line.split(\" \"); \ addr = ary[0].to_i(16); \ if(vstart <= addr && addr < vend) then \ rary.push(ary[1].to_i()/4); \ end; \ }; \ print rary.join(\",\"); \ '"; void init(void) { void* addr; char buf[128]; nmask = numa_allocate_nodemask(); numa_bitmask_setbit(nmask, 0); pagesize = getpagesize(); sprintf(buf, "%s", "mbind_vma_XXXXXX"); mapped_fd = mkstemp(buf); if (mapped_fd == -1) perror("mkstemp "), exit(1); unlink(buf); if (lseek(mapped_fd, pagesize*8, SEEK_SET) < 0) perror("lseek "), exit(1); if (write(mapped_fd, "\0", 1) < 0) perror("write "), exit(1); addr = mmap(NULL, pagesize*8, PROT_NONE, MAP_SHARED, mapped_fd, 0); if (addr == MAP_FAILED) perror("mmap "), exit(1); if (mprotect(addr+pagesize, pagesize*6, PROT_READ|PROT_WRITE) < 0) perror("mprotect "), exit(1); mmap_addr = addr + pagesize; /* make page populate */ memset(mmap_addr, 0, pagesize*6); } void fin(void) { void* addr = mmap_addr - pagesize; munmap(addr, pagesize*8); memset(buf, 0, sizeof(buf)); memset(retbuf, 0, sizeof(retbuf)); } void mem_bind(int index, int len) { int err; err = mbind(mmap_addr+pagesize*index, pagesize*len, MPOL_BIND, nmask->maskp, nmask->size, 0); if (err) perror("mbind "), exit(err); } void mem_interleave(int index, int len) { int err; err = mbind(mmap_addr+pagesize*index, pagesize*len, MPOL_INTERLEAVE, nmask->maskp, nmask->size, 0); if (err) perror("mbind "), exit(err); } void mem_unbind(int index, int len) { int err; err = mbind(mmap_addr+pagesize*index, pagesize*len, MPOL_DEFAULT, NULL, 0, 0); if (err) perror("mbind "), exit(err); } void Assert(char *expected, char *value, char *name, int line) { if (strcmp(expected, value) == 0) { fprintf(stderr, "%s: passed\n", name); return; } else { fprintf(stderr, "%s: %d: test failed. expect '%s', actual '%s'\n", name, line, expected, value); // exit(1); } } /* AAAA PPPPPPNNNNNN might become PPNNNNNNNNNN case 4 below */ void case4(void) { init(); sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6); mem_bind(0, 4); mem_unbind(2, 2); file = popen(buf, "r"); fread(retbuf, sizeof(retbuf), 1, file); Assert("2,4", retbuf, "case4", __LINE__); fin(); } /* AAAA PPPPPPNNNNNN might become PPPPPPPPPPNN case 5 below */ void case5(void) { init(); sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6); mem_bind(0, 2); mem_bind(2, 2); file = popen(buf, "r"); fread(retbuf, sizeof(retbuf), 1, file); Assert("4,2", retbuf, "case5", __LINE__); fin(); } /* AAAA PPPPNNNNXXXX might become PPPPPPPPPPPP 6 */ void case6(void) { init(); sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6); mem_bind(0, 2); mem_bind(4, 2); mem_bind(2, 2); file = popen(buf, "r"); fread(retbuf, sizeof(retbuf), 1, file); Assert("6", retbuf, "case6", __LINE__); fin(); } /* AAAA PPPPNNNNXXXX might become PPPPPPPPXXXX 7 */ void case7(void) { init(); sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6); mem_bind(0, 2); mem_interleave(4, 2); mem_bind(2, 2); file = popen(buf, "r"); fread(retbuf, sizeof(retbuf), 1, file); Assert("4,2", retbuf, "case7", __LINE__); fin(); } /* AAAA PPPPNNNNXXXX might become PPPPNNNNNNNN 8 */ void case8(void) { init(); sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6); mem_bind(0, 2); mem_interleave(4, 2); mem_interleave(2, 2); file = popen(buf, "r"); fread(retbuf, sizeof(retbuf), 1, file); Assert("2,4", retbuf, "case8", __LINE__); fin(); } void case_n(void) { init(); sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6); /* make redundunt mappings [0][1234][34][7] */ mmap(mmap_addr + pagesize*4, pagesize*2, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, mapped_fd, pagesize*3); /* Expect to do nothing. */ mem_unbind(2, 2); file = popen(buf, "r"); fread(retbuf, sizeof(retbuf), 1, file); Assert("4,2", retbuf, "case_n", __LINE__); fin(); } int main(int argc, char** argv) { case4(); case5(); case6(); case7(); case8(); case_n(); return 0; } ============================================================= Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Caspar Zhang <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: <[email protected]> [3.1.x] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-25Merge branch 'pm-sleep' into pm-for-linusRafael J. Wysocki2-7/+3
* pm-sleep: (51 commits) PM: Drop generic_subsys_pm_ops PM / Sleep: Remove forward-only callbacks from AMBA bus type PM / Sleep: Remove forward-only callbacks from platform bus type PM: Run the driver callback directly if the subsystem one is not there PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers PM / Sleep: Merge internal functions in generic_ops.c PM / Sleep: Simplify generic system suspend callbacks PM / Hibernate: Remove deprecated hibernation snapshot ioctls PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled() PM / Sleep: Recommend [un]lock_system_sleep() over using pm_mutex directly PM / Sleep: Replace mutex_[un]lock(&pm_mutex) with [un]lock_system_sleep() PM / Sleep: Make [un]lock_system_sleep() generic PM / Sleep: Use the freezer_count() functions in [un]lock_system_sleep() APIs PM / Freezer: Remove the "userspace only" constraint from freezer[_do_not]_count() PM / Hibernate: Replace unintuitive 'if' condition in kernel/power/user.c with 'else' Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to block the freezer PM / Sleep: Unify diagnostic messages from device suspend/resume ACPI / PM: Do not save/restore NVS on Asus K54C/K54HR PM / Hibernate: Remove deprecated hibernation test modes PM / Hibernate: Thaw processes in SNAPSHOT_CREATE_IMAGE ioctl test path ... Conflicts: kernel/kmod.c
2011-12-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller11-45/+70
Conflicts: net/bluetooth/l2cap_core.c Just two overlapping changes, one added an initialization of a local variable, and another change added a new local variable. Signed-off-by: David S. Miller <[email protected]>
2011-12-22Partial revert "Basic kernel memory functionality for the Memory Controller"Glauber Costa1-87/+6
This reverts commit e5671dfae59b165e2adfd4dfbdeab11ac8db5bda. After a follow up discussion with Michal, it was agreed it would be better to leave the kmem controller with just the tcp files, deferring the behavior of the other general memory.kmem.* files for a later time, when more caches are controlled. This is because generic kmem files are not used by tcp accounting and it is not clear how other slab caches would fit into the scheme. We are reverting the original commit so we can track the reference. Part of the patch is kept, because it was used by the later tcp code. Conflicts are shown in the bottom. init/Kconfig is removed from the revert entirely. Signed-off-by: Glauber Costa <[email protected]> Acked-by: Michal Hocko <[email protected]> CC: Kirill A. Shutemov <[email protected]> CC: Paul Menage <[email protected]> CC: Greg Thelen <[email protected]> CC: Johannes Weiner <[email protected]> CC: David S. Miller <[email protected]> Conflicts: Documentation/cgroups/memory.txt mm/memcontrol.c Signed-off-by: David S. Miller <[email protected]>
2011-12-22percpu: Remove irqsafe_cpu_xxx variantsChristoph Lameter1-3/+3
We simply say that regular this_cpu use must be safe regardless of preemption and interrupt state. That has no material change for x86 and s390 implementations of this_cpu operations. However, arches that do not provide their own implementation for this_cpu operations will now get code generated that disables interrupts instead of preemption. -tj: This is part of on-going percpu API cleanup. For detailed discussion of the subject, please refer to the following thread. http://thread.gmane.org/gmane.linux.kernel/1222078 Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Tejun Heo <[email protected]> LKML-Reference: <[email protected]>
2011-12-21vfs: __read_cache_page should use gfp argument rather than GFP_KERNELDave Kleikamp1-5/+2
lockdep reports a deadlock in jfs because a special inode's rw semaphore is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not used when __read_cache_page() calls add_to_page_cache_lru(). Signed-off-by: Dave Kleikamp <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Al Viro <[email protected]> Cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>
2011-12-21convert 'memory' sysdev_class to a regular subsystemKay Sievers3-29/+29
This moves the 'memory sysdev_class' over to a regular 'memory' subsystem and converts the devices to regular devices. The sysdev drivers are implemented as subsystem interfaces now. After all sysdev classes are ported to regular driver core entities, the sysdev implementation will be entirely removed from the kernel. Signed-off-by: Kay Sievers <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2011-12-21Merge branch 'master' into pm-sleepRafael J. Wysocki14-88/+146
* master: (848 commits) SELinux: Fix RCU deref check warning in sel_netport_insert() binary_sysctl(): fix memory leak mm/vmalloc.c: remove static declaration of va from __get_vm_area_node ipmi_watchdog: restore settings when BMC reset oom: fix integer overflow of points in oom_badness memcg: keep root group unchanged if creation fails nilfs2: potential integer overflow in nilfs_ioctl_clean_segments() nilfs2: unbreak compat ioctl cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask evm: prevent racing during tfm allocation evm: key must be set once during initialization mmc: vub300: fix type of firmware_rom_wait_states module parameter Revert "mmc: enable runtime PM by default" mmc: sdhci: remove "state" argument from sdhci_suspend_host x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT IB/qib: Correct sense on freectxts increment and decrement RDMA/cma: Verify private data length cgroups: fix a css_set not found bug in cgroup_attach_proc oprofile: Fix uninitialized memory access when writing to writing to oprofilefs Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel" ... Conflicts: kernel/cgroup_freezer.c
2011-12-20mm/vmalloc.c: remove static declaration of va from __get_vm_area_nodeKautuk Consul1-1/+1
Static storage is not required for the struct vmap_area in __get_vm_area_node. Removing "static" to store this variable on the stack instead. Signed-off-by: Kautuk Consul <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-20oom: fix integer overflow of points in oom_badnessFrantisek Hrbata1-1/+1
An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000). So there could be an int overflow while computing 176 points *= 1000; and points may have negative value. Meaning the oom score for a mem hog task will be one. 196 if (points <= 0) 197 return 1; For example: [ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one. In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one. The points variable should be of type long instead of int to prevent the int overflow. Signed-off-by: Frantisek Hrbata <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: <[email protected]> [2.6.36+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-20memcg: keep root group unchanged if creation failsHillf Danton1-2/+1
If the request is to create non-root group and we fail to meet it, we should leave the root unchanged. Signed-off-by: Hillf Danton <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Balbir Singh <[email protected]> Cc: David Rientjes <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-20Merge branch 'memblock-kill-early_node_map' of ↵Ingo Molnar4-880/+640
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into core/memblock
2011-12-18writeback: balanced_rate cannot exceed write bandwidthWu Fengguang1-0/+5
Add an upper limit to balanced_rate according to the below inequality. This filters out some rare but huge singular points, which at least enables more readable gnuplot figures. When there are N dd dirtiers, balanced_dirty_ratelimit = write_bw / N So it holds that balanced_dirty_ratelimit <= write_bw The singular points originate from dirty_rate in the below formular: balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate where dirty_rate = (number of page dirties in the past 200ms) / 200ms In the extreme case, if all dd tasks suddenly get blocked on something else and hence no pages are dirtied at all, dirty_rate will be 0 and balanced_dirty_ratelimit will be inf. This could happen in reality. Note that these huge singular points are not a real threat, since they are _guaranteed_ to be filtered out by the min(balanced_dirty_ratelimit, task_ratelimit) line in bdi_update_dirty_ratelimit(). task_ratelimit is based on the number of dirty pages, which will never _suddenly_ fly away like balanced_dirty_ratelimit. So any weirdly large balanced_dirty_ratelimit will be cut down to the level of task_ratelimit. There won't be tiny singular points though, as long as the dirty pages lie inside the dirty throttling region (above the freerun region). Because there the dd tasks will be throttled by balanced_dirty_pages() and won't be able to suddenly dirty much more pages than average. Acked-by: Jan Kara <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Wu Fengguang <[email protected]>
2011-12-18writeback: do strict bdi dirty_exceededWu Fengguang1-1/+1
This helps to reduce dirty throttling polls and hence CPU overheads. bdi->dirty_exceeded typically only helps when suddenly starting 100+ dd's on a disk, in which case the dd's may need to poll balance_dirty_pages() earlier than tsk->nr_dirtied_pause. CC: Jan Kara <[email protected]> CC: Peter Zijlstra <[email protected]> Signed-off-by: Wu Fengguang <[email protected]>
2011-12-18writeback: avoid tiny dirty poll intervalsWu Fengguang1-1/+24
The LKP tests see big 56% regression for the case fio_mmap_randwrite_64k. Shaohua manages to root cause it to be the much smaller dirty pause times and hence much more frequent invocations to the IO-less balance_dirty_pages(). Since fio_mmap_randwrite_64k effectively contains both reads and writes, the more frequent pauses triggered more idling in the cfq IO scheduler. The solution is to increase pause time all the way up to the max 200ms in this case, which is found to restore most performance. This will help reduce CPU overheads in other cases, too. Note that I don't expect many performance critical workloads to run this access pattern: the mmap read-on-write is rather inefficient and could be avoided by doing normal writes syscalls. CC: Jan Kara <[email protected]> CC: Peter Zijlstra <[email protected]> Reported-by: Li Shaohua <[email protected]> Tested-by: Li Shaohua <[email protected]> Signed-off-by: Wu Fengguang <[email protected]>
2011-12-18writeback: max, min and target dirty pause timeWu Fengguang1-44/+81
Control the pause time and the call intervals to balance_dirty_pages() with three parameters: 1) max_pause, limited by bdi_dirty and MAX_PAUSE 2) the target pause time, grows with the number of dd tasks and is normally limited by max_pause/2 3) the minimal pause, set to half the target pause and is used to skip short sleeps and accumulate them into bigger ones The typical behaviors after patch: - if ever task_ratelimit is far below dirty_ratelimit, the pause time will remain constant at max_pause and nr_dirtied_pause will be fluctuating with task_ratelimit - in the normal cases, nr_dirtied_pause will remain stable (keep in the same pace with dirty_ratelimit) and the pause time will be fluctuating with task_ratelimit In summary, someone has to fluctuate with task_ratelimit, because task_ratelimit = nr_dirtied_pause / pause We normally prefer a stable nr_dirtied_pause, until reaching max_pause. The notable behavior changes are: - in stable workloads, there will no longer be sudden big trajectory switching of nr_dirtied_pause as concerned by Peter. It will be as smooth as dirty_ratelimit and changing proportionally with it (as always, assuming bdi bandwidth does not fluctuate across 2^N lines, otherwise nr_dirtied_pause will show up in 2+ parallel trajectories) - in the rare cases when something keeps task_ratelimit far below dirty_ratelimit, the smoothness can no longer be retained and nr_dirtied_pause will be "dancing" with task_ratelimit. This fixes a (not that destructive but still not good) bug that dirty_ratelimit gets brought down undesirably <= balanced_dirty_ratelimit is under estimated <= weakly executed task_ratelimit <= pause goes too large and gets trimmed down to max_pause <= nr_dirtied_pause (based on dirty_ratelimit) is set too large <= dirty_ratelimit being much larger than task_ratelimit - introduce min_pause to avoid small pause sleeps - when pause is trimmed down to max_pause, try to compensate it at the next pause time The "refactor" type of changes are: The max_pause equation is slightly transformed to make it slightly more efficient. We now scale target_pause by (N * 10ms) on 2^N concurrent tasks, which is effectively equal to the original scaling max_pause by (N * 20ms) because the original code does implicit target_pause ~= max_pause / 2. Based on the same implicit ratio, target_pause starts with 10ms on 1 dd. CC: Jan Kara <[email protected]> CC: Peter Zijlstra <[email protected]> Signed-off-by: Wu Fengguang <[email protected]>
2011-12-18writeback: dirty ratelimit - think time compensationWu Fengguang1-4/+32
Compensate the task's think time when computing the final pause time, so that ->dirty_ratelimit can be executed accurately. think time := time spend outside of balance_dirty_pages() In the rare case that the task slept longer than the 200ms period time (result in negative pause time), the sleep time will be compensated in the following periods, too, if it's less than 1 second. Accumulated errors are carefully avoided as long as the max pause area is not hitted. Pseudo code: period = pages_dirtied / task_ratelimit; think = jiffies - dirty_paused_when; pause = period - think; 1) normal case: period > think pause = period - think dirty_paused_when = jiffies + pause nr_dirtied = 0 period time |===============================>| think time pause time |===============>|==============>| ------|----------------|---------------|------------------------ dirty_paused_when jiffies 2) no pause case: period <= think don't pause; reduce future pause time by: dirty_paused_when += period nr_dirtied = 0 period time |===============================>| think time |===================================================>| ------|--------------------------------+-------------------|---- dirty_paused_when jiffies Acked-by: Jan Kara <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Wu Fengguang <[email protected]>
2011-12-18writeback: fix dirtied pages accounting on redirtyWu Fengguang1-0/+19
De-account the accumulative dirty counters on page redirty. Page redirties (very common in ext4) will introduce mismatch between counters (a) and (b) a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied b) NR_WRITTEN, BDI_WRITTEN This will introduce systematic errors in balanced_rate and result in dirty page position errors (ie. the dirty pages are no longer balanced around the global/bdi setpoints). Acked-by: Jan Kara <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Wu Fengguang <[email protected]>
2011-12-18writeback: fix dirtied pages accounting on sub-page writesWu Fengguang1-8/+5
When dd in 512bytes, generic_perform_write() calls balance_dirty_pages_ratelimited() 8 times for the same page, but obviously the page is only dirtied once. Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time. Acked-by: Jan Kara <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Wu Fengguang <[email protected]>
2011-12-18writeback: charge leaked page dirties to active tasksWu Fengguang1-0/+27
It's a years long problem that a large number of short-lived dirtiers (eg. gcc instances in a fast kernel build) may starve long-run dirtiers (eg. dd) as well as pushing the dirty pages to the global hard limit. The solution is to charge the pages dirtied by the exited gcc to the other random dirtying tasks. It sounds not perfect, however should behave good enough in practice, seeing as that throttled tasks aren't actually running so those that are running are more likely to pick it up and get throttled, therefore promoting an equal spread. Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c Acked-by: Jan Kara <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Wu Fengguang <[email protected]>
2011-12-15percpu: fix per_cpu_ptr_to_phys() handling of non-page-aligned addressesEugene Surovegin1-2/+4
per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc case to the page boundary, which is bogus for any non-page-aligned address. This affects the only in-tree user of this function - sysfs handler for per-cpu 'crash_notes' physical address. The trouble is that the crash_notes per-cpu variable is not page-aligned: crash_notes = 0xc08e8ed4 PER-CPU OFFSET VALUES: CPU 0: 3711f000 CPU 1: 37129000 CPU 2: 37133000 CPU 3: 3713d000 So, the per-cpu addresses are: crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4 crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4 crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4 crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4 However, /sys/devices/system/cpu/cpu*/crash_notes says: /sys/devices/system/cpu/cpu0/crash_notes: 36b57000 /sys/devices/system/cpu/cpu1/crash_notes: 36b4d000 /sys/devices/system/cpu/cpu2/crash_notes: 36b43000 /sys/devices/system/cpu/cpu3/crash_notes: 36b39000 As you can see, all values are rounded down to a page boundary. Consequently, this is where kexec sets up the NOTE segments, and thus where the secondary kernel is looking for them. However, when the first kernel crashes, it saves the notes to the unaligned addresses, where they are not found. Fix it by adding offset_in_page() to the translated page address. -tj: Combined Eugene's and Petr's commit messages. Signed-off-by: Eugene Surovegin <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Reported-by: Petr Tesarik <[email protected]> Cc: [email protected]
2011-12-13Merge branch 'writeback-for-linus' of ↵Linus Torvalds2-6/+32
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: set max_pause to lowest value on zero bdi_dirty writeback: permit through good bdi even when global dirty exceeded writeback: comment on the bdi dirty threshold fs: Make write(2) interruptible by a fatal signal writeback: Fix issue on make htmldocs
2011-12-13slub: add missed accountingShaohua Li1-2/+5
With per-cpu partial list, slab is added to partial list first and then moved to node list. The __slab_free() code path for add/remove_partial is almost deprecated(except for slub debug). But we forget to account add/remove_partial when move per-cpu partial pages to node list, so the statistics for such events are always 0. Add corresponding accounting. This is against the patch "slub: use correct parameter to add a page to partial list tail" Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Shaohua Li <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2011-12-13slub: Extract get_freelist from __slab_allocChristoph Lameter1-25/+32
get_freelist retrieves free objects from the page freelist (put there by remote frees) or deactivates a slab page if no more objects are available. Acked-by: David Rientjes <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2011-12-13slub: Switch per cpu partial page support off for debuggingChristoph Lameter1-1/+3
Eric saw an issue with accounting of slabs during validation. Its not possible to determine accurately how many per cpu partial slabs exist at any time so this switches off per cpu partial pages during debug. Acked-by: Eric Dumazet <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2011-12-13slub: fix a possible memleak in __slab_alloc()Eric Dumazet1-0/+5
Zhihua Che reported a possible memleak in slub allocator on CONFIG_PREEMPT=y builds. It is possible current thread migrates right before disabling irqs in __slab_alloc(). We must check again c->freelist, and perform a normal allocation instead of scratching c->freelist. Many thanks to Zhihua Che for spotting this bug, introduced in 2.6.39 V2: Its also possible an IRQ freed one (or several) object(s) and populated c->freelist, so its not a CONFIG_PREEMPT only problem. Cc: <[email protected]> [2.6.39+] Reported-by: Zhihua Che <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2011-12-12cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), ↵Tejun Heo1-8/+8
cancel_attach() and attach() Currently, there's no way to pass multiple tasks to cgroup_subsys methods necessitating the need for separate per-process and per-task methods. This patch introduces cgroup_taskset which can be used to pass multiple tasks and their associated cgroups to cgroup_subsys methods. Three methods - can_attach(), cancel_attach() and attach() - are converted to use cgroup_taskset. This unifies passed parameters so that all methods have access to all information. Conversions in this patchset are identical and don't introduce any behavior change. -v2: documentation updated as per Paul Menage's suggestion. Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: Paul Menage <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: James Morris <[email protected]>
2011-12-12tcp memory pressure controlsGlauber Costa1-1/+39
This patch introduces memory pressure controls for the tcp protocol. It uses the generic socket memory pressure code introduced in earlier patches, and fills in the necessary data in cg_proto struct. Signed-off-by: Glauber Costa <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> CC: Eric W. Biederman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-12-12socket: initial cgroup code.Glauber Costa1-2/+44
The goal of this work is to move the memory pressure tcp controls to a cgroup, instead of just relying on global conditions. To avoid excessive overhead in the network fast paths, the code that accounts allocated memory to a cgroup is hidden inside a static_branch(). This branch is patched out until the first non-root cgroup is created. So when nobody is using cgroups, even if it is mounted, no significant performance penalty should be seen. This patch handles the generic part of the code, and has nothing tcp-specific. Signed-off-by: Glauber Costa <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> CC: Kirill A. Shutemov <[email protected]> CC: David S. Miller <[email protected]> CC: Eric W. Biederman <[email protected]> CC: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-12-12Basic kernel memory functionality for the Memory ControllerGlauber Costa1-5/+100
This patch lays down the foundation for the kernel memory component of the Memory Controller. As of today, I am only laying down the following files: * memory.independent_kmem_limit * memory.kmem.limit_in_bytes (currently ignored) * memory.kmem.usage_in_bytes (always zero) Signed-off-by: Glauber Costa <[email protected]> CC: Kirill A. Shutemov <[email protected]> CC: Paul Menage <[email protected]> CC: Greg Thelen <[email protected]> CC: Johannes Weiner <[email protected]> CC: Michal Hocko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-12-09mm: vmalloc: check for page allocation failure before vmlist insertionMel Gorman1-0/+2
Commit f5252e00 ("mm: avoid null pointer access in vm_struct via /proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after it is fully initialised. Unfortunately, it did not check that __vmalloc_area_node() successfully populated the area. In the event of allocation failure, the vmalloc area is freed but the pointer to freed memory is inserted into the vmlist leading to a a crash later in get_vmalloc_info(). This patch adds a check for ____vmalloc_area_node() failure within __vmalloc_node_range. It does not use "goto fail" as in the previous error path as a warning was already displayed by __vmalloc_area_node() before it called vfree in its failure path. Credit goes to Luciano Chavez for doing all the real work of identifying exactly where the problem was. Signed-off-by: Mel Gorman <[email protected]> Reported-by: Luciano Chavez <[email protected]> Tested-by: Luciano Chavez <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: <[email protected]> [3.1.x+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-09mm: Ensure that pfn_valid() is called once per pageblock when reserving ↵Michal Hocko1-1/+7
pageblocks setup_zone_migrate_reserve() expects that zone->start_pfn starts at pageblock_nr_pages aligned pfn otherwise we could access beyond an existing memblock resulting in the following panic if CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid: IP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180 *pdpt = 0000000000000000 *pde = f000ff53f000ff53 Oops: 0000 [#1] SMP Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform EIP: 0060:[<c02d331d>] EFLAGS: 00010006 CPU: 0 EIP is at setup_zone_migrate_reserve+0xcd/0x180 EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000 ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000) Call Trace: [<c02d389c>] __setup_per_zone_wmarks+0xec/0x160 [<c02d3a1f>] setup_per_zone_wmarks+0xf/0x20 [<c08a771c>] init_per_zone_wmark_min+0x27/0x86 [<c020111b>] do_one_initcall+0x2b/0x160 [<c086639d>] kernel_init+0xbe/0x157 [<c05cae26>] kernel_thread_helper+0x6/0xd Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f EIP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58 CR2: 00000000000012b4 We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because highstart_pfn = 0x36ffe. The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page in a pageblock is reserved before marking it MIGRATE_RESERVE"). Make sure that start_pfn is always aligned to pageblock_nr_pages to ensure that pfn_valid s always called at the start of each pageblock. Architectures with holes in pageblocks will be correctly handled by pfn_valid_within in pageblock_is_reserved. Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Tested-by: Dang Bo <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: David Rientjes <[email protected]> Cc: Arve Hjnnevg <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: John Stultz <[email protected]> Cc: Dave Hansen <[email protected]> Cc: <[email protected]> [3.0+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-09mm/migrate.c: pair unlock_page() and lock_page() when migrating huge pagesHillf Danton1-1/+1
Avoid unlocking and unlocked page if we failed to lock it. Signed-off-by: Hillf Danton <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-09thp: set compound tail page _count to zeroYouquan Song2-1/+2
Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all page_tail->_count zero at all times. But the current kernel does not set page_tail->_count to zero if a 1GB page is utilized. So when an IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a tail page's _count does not equal zero. kernel BUG at include/linux/mm.h:386! invalid opcode: 0000 [#1] SMP Call Trace: gup_pud_range+0xb8/0x19d get_user_pages_fast+0xcb/0x192 ? trace_hardirqs_off+0xd/0xf hva_to_pfn+0x119/0x2f2 gfn_to_pfn_memslot+0x2c/0x2e kvm_iommu_map_pages+0xfd/0x1c1 kvm_iommu_map_memslots+0x7c/0xbd kvm_iommu_map_guest+0xaa/0xbf kvm_vm_ioctl_assigned_device+0x2ef/0xa47 kvm_vm_ioctl+0x36c/0x3a2 do_vfs_ioctl+0x49e/0x4e4 sys_ioctl+0x5a/0x7c system_call_fastpath+0x16/0x1b RIP gup_huge_pud+0xf2/0x159 Signed-off-by: Youquan Song <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-09thp: reduce khugepaged freezing latencyAndrea Arcangeli1-12/+4
khugepaged can sometimes cause suspend to fail, requiring that the user retry the suspend operation. Use wait_event_freezable_timeout() instead of schedule_timeout_interruptible() to avoid missing freezer wakeups. A try_to_freeze() would have been needed in the khugepaged_alloc_hugepage tight loop too in case of the allocation failing repeatedly, and wait_event_freezable_timeout will provide it too. khugepaged would still freeze just fine by trying again the next minute but it's better if it freezes immediately. Reported-by: Jiri Slaby <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Tested-by: Jiri Slaby <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-09vmscan: use atomic-long for shrinker batchingKonstantin Khlebnikov1-10/+7
Use atomic-long operations instead of looping around cmpxchg(). [[email protected]: massage atomic.h inclusions] Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Dave Chinner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-09vmscan: fix initial shrinker size handlingKonstantin Khlebnikov1-3/+6
A shrinker function can return -1, means that it cannot do anything without a risk of deadlock. For example prune_super() does this if it cannot grab a superblock refrence, even if nr_to_scan=0. Currently we interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan' according to this. So the next time around this shrinker can cause really big pressure. Let's skip such shrinkers instead. Also make total_scan signed, otherwise the check (total_scan < 0) below never works. Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Dave Chinner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-12-08memblock: Reimplement memblock allocation using reverse free area iteratorTejun Heo1-146/+127
Now that all early memory information is in memblock when enabled, we can implement reverse free area iterator and use it to implement NUMA aware allocator which is then wrapped for simpler variants instead of the confusing and inefficient mending of information in separate NUMA aware allocator. Implement for_each_free_mem_range_reverse(), use it to reimplement memblock_find_in_range_node() which in turn is used by all allocators. The visible allocator interface is inconsistent and can probably use some cleanup too. Signed-off-by: Tejun Heo <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Yinghai Lu <[email protected]>
2011-12-08memblock: Kill early_node_map[]Tejun Heo2-242/+19
Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP - there's no user of early_node_map[] left. Kill early_node_map[] and replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also, relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h as page_alloc.c would no longer host an alternative implementation. This change is ultimately one to one mapping and shouldn't cause any observable difference; however, after the recent changes, there are some functions which now would fit memblock.c better than page_alloc.c and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK doesn't make much sense on some of them. Further cleanups for functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice. -v2: Fix compile bug introduced by mis-spelling CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in mmzone.h. Reported by Stephen Rothwell. Signed-off-by: Tejun Heo <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tony Luck <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Chen Liqin <[email protected]> Cc: Paul Mundt <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: "H. Peter Anvin" <[email protected]>
2011-12-08memblock: Implement memblock_add_node()Tejun Heo1-7/+13
Implement memblock_add_node() which can add a new memblock memory region with specific node ID. Signed-off-by: Tejun Heo <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Yinghai Lu <[email protected]>