aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2010-08-09oom: sacrifice child with highest badness score for parentDavid Rientjes1-11/+29
When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Nick Piggin <[email protected]> Acked-by: Balbir Singh <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09oom: filter tasks not sharing the same cpusetDavid Rientjes1-8/+2
Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill. Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed. Killing tasks outside of current's cpuset rarely would free memory for current anyway. To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown. It is better to kill current than another task needlessly. Signed-off-by: David Rientjes <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Nick Piggin <[email protected]> Acked-by: Balbir Singh <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09oom: avoid sending exiting tasks a SIGKILLDavid Rientjes1-1/+1
It's unnecessary to SIGKILL a task that is already PF_EXITING and can actually cause a NULL pointer dereference of the sighand if it has already been detached. Instead, simply set TIF_MEMDIE so it has access to memory reserves and can quickly exit as the comment implies. Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09oom: give current access to memory reserves if it has been killedDavid Rientjes1-0/+10
It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09oom: dump_tasks use find_lock_task_mm too fixDavid Rientjes1-2/+2
When find_lock_task_mm() returns a thread other than p in dump_tasks(), its name should be displayed instead. This is the thread that will be targeted by the oom killer, not its mm-less parent. This also allows us to safely dereference task->comm without needing get_task_comm(). While we're here, remove the cast on task_cpu(task) as Andrew suggested. Signed-off-by: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09oom: improve commentary in dump_tasks()David Rientjes1-8/+3
The comments in dump_tasks() should be updated to be more clear about why tasks are filtered and how they are filtered by its argument. An unnecessary comment concerning a check for is_global_init() is removed since it isn't of importance. Suggested-by: Andrew Morton <[email protected]> Signed-off-by: David Rientjes <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09oom: dump_tasks use find_lock_task_mm tooKOSAKI Motohiro1-18/+21
dump_task() should use find_lock_task_mm() too. It is necessary for protecting task-exiting race. dump_tasks() currently filters any task that does not have an attached ->mm since it incorrectly assumes that it must either be in the process of exiting and has detached its memory or that it's a kernel thread; multithreaded tasks may actually have subthreads that have a valid ->mm pointer and thus those threads should actually be displayed. This change finds those threads, if they exist, and emit their information along with the rest of the candidate tasks for kill. Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: David Rientjes <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09oom: introduce find_lock_task_mm() to fix !mm false positivesOleg Nesterov1-31/+43
Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the "if (!p->mm)" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [[email protected]: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: David Rientjes <[email protected]> Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09oom: PF_EXITING check should take mm into accountOleg Nesterov1-1/+1
select_bad_process() checks PF_EXITING to detect the task which is going to release its memory, but the logic is very wrong. - a single process P with the dead group leader disables select_bad_process() completely, it will always return ERR_PTR() while P can live forever - if the PF_EXITING task has already released its ->mm it doesn't make sense to expect it is goiing to free more memory (except task_struct/etc) Change the code to ignore the PF_EXITING tasks without ->mm. Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: David Rientjes <[email protected]> Cc: Balbir Singh <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09oom: check PF_KTHREAD instead of !mm to skip kthreadsOleg Nesterov1-6/+3
select_bad_process() thinks a kernel thread can't have ->mm != NULL, this is not true due to use_mm(). Change the code to check PF_KTHREAD. Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: David Rientjes <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09buffer_head: remove redundant test from wait_on_bufferRichard Kennedy1-6/+1
The comment suggests that when b_count equals zero it is calling __wait_no_buffer to trigger some debug, but as there is no debug in __wait_on_buffer the whole thing is redundant. AFAICT from the git log this has been the case for at least 5 years, so it seems safe just to remove this. Signed-off-by: Richard Kennedy <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jeff Mahoney <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09mm: extend KSM refcounts to the anon_vma rootRik van Riel4-19/+69
KSM reference counts can cause an anon_vma to exist after the processe it belongs to have already exited. Because the anon_vma lock now lives in the root anon_vma, we need to ensure that the root anon_vma stays around until after all the "child" anon_vmas have been freed. The obvious way to do this is to have a "child" anon_vma take a reference to the root in anon_vma_fork. When the anon_vma is freed at munmap or process exit, we drop the refcount in anon_vma_unlink and possibly free the root anon_vma. The KSM anon_vma reference count function also needs to be modified to deal with the possibility of freeing 2 levels of anon_vma. The easiest way to do this is to break out the KSM magic and make it generic. When compiling without CONFIG_KSM, this code is compiled out. Signed-off-by: Rik van Riel <[email protected]> Tested-by: Larry Woodman <[email protected]> Acked-by: Larry Woodman <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Linus Torvalds <[email protected]> Tested-by: Dave Young <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09mm: always lock the root (oldest) anon_vmaRik van Riel4-14/+28
Always (and only) lock the root (oldest) anon_vma whenever we do something in an anon_vma. The recently introduced anon_vma scalability is due to the rmap code scanning only the VMAs that need to be scanned. Many common operations still took the anon_vma lock on the root anon_vma, so always taking that lock is not expected to introduce any scalability issues. However, always taking the same lock does mean we only need to take one lock, which means rmap_walk on pages from any anon_vma in the vma is excluded from occurring during an munmap, expand_stack or other operation that needs to exclude rmap_walk and similar functions. Also add the proper locking to vma_adjust. Signed-off-by: Rik van Riel <[email protected]> Tested-by: Larry Woodman <[email protected]> Acked-by: Larry Woodman <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Linus Torvalds <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09mm: track the root (oldest) anon_vmaRik van Riel2-2/+17
Track the root (oldest) anon_vma in each anon_vma tree. Because we only take the lock on the root anon_vma, we cannot use the lock on higher-up anon_vmas to lock anything. This makes it impossible to do an indirect lookup of the root anon_vma, since the data structures could go away from under us. However, a direct pointer is safe because the root anon_vma is always the last one that gets freed on munmap or exit, by virtue of the same_vma list order and unlink_anon_vmas walking the list forward. [[email protected]: fix typo] Signed-off-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Tested-by: Larry Woodman <[email protected]> Acked-by: Larry Woodman <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: Linus Torvalds <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09mm: change direct call of spin_lock(anon_vma->lock) to inline functionRik van Riel5-21/+31
Subsitute a direct call of spin_lock(anon_vma->lock) with an inline function doing exactly the same. This makes it easier to do the substitution to the root anon_vma lock in a following patch. We will deal with the handful of special locks (nested, dec_and_lock, etc) separately. Signed-off-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Tested-by: Larry Woodman <[email protected]> Acked-by: Larry Woodman <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: Linus Torvalds <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09mm: rename anon_vma_lock to vma_lock_anon_vmaRik van Riel2-9/+9
Rename anon_vma_lock to vma_lock_anon_vma. This matches the naming style used in page_lock_anon_vma and will come in really handy further down in this patch series. Signed-off-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Tested-by: Larry Woodman <[email protected]> Acked-by: Larry Woodman <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: Linus Torvalds <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09kmap_atomic: make kunmap_atomic() harder to misuseCesar Eduardo Barros14-20/+28
kunmap_atomic() is currently at level -4 on Rusty's "Hard To Misuse" list[1] ("Follow common convention and you'll get it wrong"), except in some architectures when CONFIG_DEBUG_HIGHMEM is set[2][3]. kunmap() takes a pointer to a struct page; kunmap_atomic(), however, takes takes a pointer to within the page itself. This seems to once in a while trip people up (the convention they are following is the one from kunmap()). Make it much harder to misuse, by moving it to level 9 on Rusty's list[4] ("The compiler/linker won't let you get it wrong"). This is done by refusing to build if the type of its first argument is a pointer to a struct page. The real kunmap_atomic() is renamed to kunmap_atomic_notypecheck() (which is what you would call in case for some strange reason calling it with a pointer to a struct page is not incorrect in your code). The previous version of this patch was compile tested on x86-64. [1] http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html [2] In these cases, it is at level 5, "Do it right or it will always break at runtime." [3] At least mips and powerpc look very similar, and sparc also seems to share a common ancestor with both; there seems to be quite some degree of copy-and-paste coding here. The include/asm/highmem.h file for these three archs mention x86 CPUs at its top. [4] http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html [5] As an aside, could someone tell me why mn10300 uses unsigned long as the first parameter of kunmap_atomic() instead of void *? Signed-off-by: Cesar Eduardo Barros <[email protected]> Cc: Russell King <[email protected]> (arch/arm) Cc: Ralf Baechle <[email protected]> (arch/mips) Cc: David Howells <[email protected]> (arch/frv, arch/mn10300) Cc: Koichi Yasutake <[email protected]> (arch/mn10300) Cc: Kyle McMartin <[email protected]> (arch/parisc) Cc: Helge Deller <[email protected]> (arch/parisc) Cc: "James E.J. Bottomley" <[email protected]> (arch/parisc) Cc: Benjamin Herrenschmidt <[email protected]> (arch/powerpc) Cc: Paul Mackerras <[email protected]> (arch/powerpc) Cc: "David S. Miller" <[email protected]> (arch/sparc) Cc: Thomas Gleixner <[email protected]> (arch/x86) Cc: Ingo Molnar <[email protected]> (arch/x86) Cc: "H. Peter Anvin" <[email protected]> (arch/x86) Cc: Arnd Bergmann <[email protected]> (include/asm-generic) Cc: Rusty Russell <[email protected]> ("Hard To Misuse" list) Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09hugetlb: call mmu notifiers on hugepage cowDoug Doan1-0/+6
When a copy-on-write occurs, we take one of two paths in handle_mm_fault: through handle_pte_fault for normal pages, or through hugetlb_fault for huge pages. In the normal page case, we eventually get to do_wp_page and call mmu notifiers via ptep_clear_flush_notify. There is no callout to the mmmu notifiers in the huge page case. This patch fixes that. Signed-off-by: Doug Doan <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09mm: provide init_mm mm_context initializerHeiko Carstens3-4/+11
Provide an INIT_MM_CONTEXT intializer macro which can be used to statically initialize mm_struct:mm_context of init_mm. This way we can get rid of code which will do the initialization at run time (on s390). In addition the current code can be found at a place where it is not expected. So let's have a common initializer which architectures can use if needed. This is based on a patch from Suzuki Poulose. Signed-off-by: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Suzuki Poulose <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09mm: use ERR_CASTJulia Lawall1-1/+1
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // </smpl> Signed-off-by: Julia Lawall <[email protected]> Cc: Nick Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09mm: use memdup_userJulia Lawall1-8/+3
Use memdup_user when user data is immediately copied into the allocated region. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression from,to,size,flag; position p; identifier l1,l2; @@ - to = \(kmalloc@p\|kzalloc@p\)(size,flag); + to = memdup_user(from,size); if ( - to==NULL + IS_ERR(to) || ...) { <+... when != goto l1; - -ENOMEM + PTR_ERR(to) ...+> } - if (copy_from_user(to, from, size) != 0) { - <+... when != goto l2; - -EFAULT - ...+> - } // </smpl> Signed-off-by: Julia Lawall <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09asm-generic: use raw_local_irq_save/restore instead local_irq_save/restoreMichal Simek1-6/+6
The start/stop_critical_timing functions for preemptirqsoff, preemptoff and irqsoff tracers contain atomic_inc() and atomic_dec() operations. Atomic operations use local_irq_save/restore macros to ensure atomic access but they are traced by the same function which is causing recursion problem. The reason is when these tracers are turn ON then the local_irq_save/restore macros are changed in include/linux/irqflags.h to call trace_hardirqs_on/off which call start/stop_critical_timing. Microblaze was affected because it uses generic atomic implementation. Signed-off-by: Michal Simek <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09drivers/video/w100fb.c: ignore void return value / fix build failurePeter Huewe1-2/+2
Fix a build failure "error: void value not ignored as it ought to be" by removing an assignment of a void return value. The functionality of the code is not changed. Signed-off-by: Peter Huewe <[email protected]> Acked-by: Henrik Kretzschmar <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09ipmi: fix ACPI detection with regspacingYinghai Lu1-0/+8
After the commit that changed ipmi_si detecting sequence from SMBIOS/ACPI to ACPI/SMBIOS, | commit 754d453185275951d39792865927ec494fa1ebd8 | Author: Matthew Garrett <[email protected]> | Date: Wed May 26 14:43:47 2010 -0700 | | ipmi: change device discovery order | | The ipmi spec provides an ordering for si discovery. Change the driver to | match, with the exception of preferring smbios to SPMI as HPs (at least) | contain accurate information in the former but not the latter. ipmi_si can not be initialized. [ 138.799739] calling init_ipmi_devintf+0x0/0x109 @ 1 [ 138.805050] ipmi device interface [ 138.818131] initcall init_ipmi_devintf+0x0/0x109 returned 0 after 12797 usecs [ 138.822998] calling init_ipmi_si+0x0/0xa90 @ 1 [ 138.840276] IPMI System Interface driver. [ 138.846137] ipmi_si: probing via ACPI [ 138.849225] ipmi_si 00:09: [io 0x0ca2] regsize 1 spacing 1 irq 0 [ 138.864438] ipmi_si: Adding ACPI-specified kcs state machine [ 138.870893] ipmi_si: probing via SMBIOS [ 138.880945] ipmi_si: Adding SMBIOS-specified kcs state machineipmi_si: duplicate interface [ 138.896511] ipmi_si: probing via SPMI [ 138.899861] ipmi_si: Adding SPMI-specified kcs state machineipmi_si: duplicate interface [ 138.917095] ipmi_si: Trying ACPI-specified kcs state machine at i/o address 0xca2, slave address 0x0, irq 0 [ 138.928658] ipmi_si: Interface detection failed [ 138.953411] initcall init_ipmi_si+0x0/0xa90 returned 0 after 110847 usecs in smbios has DMI/SMBIOS Handle 0x00C5, DMI type 38, 18 bytes IPMI Device Information Interface Type: KCS (Keyboard Control Style) Specification Version: 2.0 I2C Slave Address: 0x00 NV Storage Device: Not Present Base Address: 0x0000000000000CA2 (I/O) Register Spacing: 32-bit Boundaries in DSDT has Device (BMC) { Name (_HID, EisaId ("IPI0001")) Method (_STA, 0, NotSerialized) { If (LEqual (OSN, Zero)) { Return (Zero) } Return (0x0F) } Name (_STR, Unicode ("IPMI_KCS")) Name (_UID, Zero) Name (_CRS, ResourceTemplate () { IO (Decode16, 0x0CA2, // Range Minimum 0x0CA2, // Range Maximum 0x00, // Alignment 0x01, // Length ) IO (Decode16, 0x0CA6, // Range Minimum 0x0CA6, // Range Maximum 0x00, // Alignment 0x01, // Length ) }) Method (_IFT, 0, NotSerialized) { Return (One) } Method (_SRV, 0, NotSerialized) { Return (0x0200) } } so the reg spacing should be 4 instead of 1. Try to calculate regspacing for this kind of system. Observed on a Sun Fire X4800. Other OSes work and pass certification. Signed-off-by: Yinghai Lu <[email protected]> Cc: Bjorn Helgaas <[email protected]> Acked-by: Matthew Garrett <[email protected]> Cc: Len Brown <[email protected]> Cc: Myron Stowe <[email protected]> Cc: Corey Minyard <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2-14/+6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: drm: fix fallouts from slow-work -> wq conversion workqueue: workqueue_cpu_callback() should be cpu_notifier instead of hotcpu_notifier workqueue: add missing __percpu markup in kernel/workqueue.c
2010-08-10Merge git://git.infradead.org/users/dwmw2/libraid-2.6 into for-linusNeilBrown6415-197539/+421380
2010-08-09no need for list_for_each_entry_safe()/resetting with superblock listAl Viro2-20/+28
just delay __put_super() a bit Signed-off-by: Al Viro <[email protected]>
2010-08-09Fix sget() race with failing mountAl Viro3-1/+8
If sget() finds a matching superblock being set up, it'll grab an active reference to it and grab s_umount. That's fine - we'll wait for completion of foofs_get_sb() that way. However, if said foofs_get_sb() fails we'll end up holding the halfway-created superblock. deactivate_locked_super() called by foofs_get_sb() will just unlock the sucker since we are holding another active reference to it. What we need is a way to tell if superblock has been successfully set up. Unfortunately, neither ->s_root nor the check for MS_ACTIVE quite fit. Cheap and easy way, suitable for backport: new flag set by the (only) caller of ->get_sb(). If that flag isn't present by the time sget() grabbed s_umount on preexisting superblock it has found, it's seeing a stillborn and should just bury it with deactivate_locked_super() (and repeat the search). Longer term we want to set that flag in ->get_sb() instances (and check for it to distinguish between "sget() found us a live sb" and "sget() has allocated an sb, we need to set it up" in there, instead of checking ->s_root as we do now). Signed-off-by: Al Viro <[email protected]> Cc: [email protected]
2010-08-09vfs: don't hold s_umount over close_bdev_exclusive() callTejun Heo1-0/+9
Fix an obscure AB-BA deadlock in get_sb_bdev(). When a superblock is mounted more than once get_sb_bdev() calls close_bdev_exclusive() to drop the extra bdev reference while holding s_umount. However, sb->s_umount nests inside bd_mutex during __invalidate_device() and close_bdev_exclusive() acquires bd_mutex during blkdev_put(); thus creating an AB-BA deadlock. This condition doesn't trigger frequently. For this condition to be visible to lockdep, the filesystem must occupy the whole device (as __invalidate_device() only grabs bd_mutex for the whole device), the FS must be mounted more than once and partition rescan should be issued while the FS is still mounted. Fix it by dropping s_umount over close_bdev_exclusive(). Signed-off-by: Tejun Heo <[email protected]> Reported-by: Ciprian Docan <[email protected]> Cc: Al Viro <[email protected]> Acked-by: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09sysv: do not mark superblock dirty on remountArtem Bityutskiy1-2/+2
No need to mark the superblock as dirty in sysv_remount, synchronize it instead (only if mounting R/O). I did not find any docs about this file-system, and I have no possibility to test my changes. Thus, this is untested. I see other issues in sysv, e.g., why sysv_sync_fs writes only in the FSTYPE_SYSV4 case? However, it marks its SB bh's dirty for all types, and does not wait for them ever. With zero docs I'm unable to fix this. Signed-off-by: Artem Bityutskiy <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09sysv: do not mark superblock dirty on mountArtem Bityutskiy1-1/+0
I did not find any docs about this file-system, and I have no possibility to test my changes. Thus, this is untested. Signed-off-by: Artem Bityutskiy <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09btrfs: remove junk sb_dirt changeArtem Bityutskiy1-1/+0
BTRFS does not define a '->write_super()' method, so it should not mark its superblock as dirty. This looks like some left-over. Signed-off-by: Artem Bityutskiy <[email protected]> Acked-by: Chris Mason <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09BFS: clean up the superblock usageArtem Bityutskiy3-43/+7
BFS is a very simple FS and its superblocks contains only static information and is never changed. However, the BFS code for some misterious reasons marked its buffer head as dirty from time to time, but nothing in that buffer was ever changed. This patch removes all the BFS superblock manipulation, simply because it is not needed. It removes: 1. The si_sbh filed from 'struct bfs_sb_info' because it is not needed. We only need to read the SB once on mount to get the start of data blocks and the FS size. After this, we can forget about the SB. 2. All instances of 'mark_buffer_dirty(sbh)' for BFS SB because it is never changed. 3. The '->sync_fs()' method because there is nothing to sync (inodes are synched by VFS). 4. The '->write_super()' method, again, because the SB is never changed. Tested-by: Artem Bityutskiy <[email protected]> Signed-off-by: Artem Bityutskiy <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09AFFS: wait for sb synchronization when neededArtem Bityutskiy1-5/+7
AFFS does not ever wait for superblock synchronization in ->put_super(), ->write_super, and ->sync_fs(). However, it should wait for synchronization in ->put_super() because it is about to be unmounted, in ->write_super() because this is periodic SB synchronization performed from a separate kernel thread, and in ->sync_fs() it should respect the 'wait' flag. This patch fixes the situation. Also, in ->put_super(), do not write the SB if it is not dirty. Tested-by: Artem Bityutskiy <[email protected]> Signed-off-by: Artem Bityutskiy <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09AFFS: clean up dirty flag usageArtem Bityutskiy1-14/+5
In 'affs_write_super()': remove ancient and wrong commented code, remove unneeded 'clean' variable, so the function becomes a bit cleaner and simpler. In 'affs_remount(): remove unnecessary SB dirty flag changes. Tested-by: Artem Bityutskiy <[email protected]> Signed-off-by: Artem Bityutskiy <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09cifs: truncate falloutChristoph Hellwig1-23/+7
Remove the calls to inode_newsize_ok given that we already did it as part of inode_change_ok in the beginning of cifs_setattr_(no)unix. No need to call ->truncate if cifs doesn't have one, so remove the explicit call in cifs_vmtruncate, and replace the calls to vmtruncate with truncate_setsize which is vmtruncate minus inode_newsize_ok and the call to ->truncate. Rename cifs_vmtruncate to cifs_setsize to match the new calling conventions. Question 1: why does cifs do the pagecache munging and i_size update twice for each setattr call, once opencoded in cifs_vmtruncate, and once using the VFS helpers? Question 2: what is supposed to be protected by i_lock in cifs_vmtruncate? Do we need it around the call to inode_change_ok? [AV: fixed build breakage] Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09mbcache: fix shrinker function return valueAndreas Gruenbacher1-17/+10
The shrinker function is supposed to return the number of cache entries after shrinking, not before shrinking. Fix that. Based on a patch from Wang Sheng-Hui <[email protected]>. Signed-off-by: Andreas Gruenbacher <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09mbcache: Remove unused featuresAndreas Gruenbacher5-137/+60
The mbcache code was written to support a variable number of indexes, but all the existing users use exactly one index. Simplify to code to support only that case. There are also no users of the cache entry free operation, and none of the users keep extra data in cache entries. Remove those features as well. Signed-off-by: Andreas Gruenbacher <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09add f_flags to struct statfs(64)Christoph Hellwig5-13/+89
Add a flags field to help glibc implementing statvfs(3) efficiently. We copy the flag values from glibc, and add a new ST_VALID flag to denote that f_flags is implemented. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09pass a struct path to vfs_statfsChristoph Hellwig11-46/+67
We'll need the path to implement the flags field for statvfs support. We do have it available in all callers except: - ecryptfs_statfs. This one doesn't actually need vfs_statfs but just needs to do a caller to the lower filesystem statfs method. - sys_ustat. Add a non-exported statfs_by_dentry helper for it which doesn't won't be able to fill out the flags field later on. In addition rename the helpers for statfs vs fstatfs to do_*statfs instead of the misleading vfs prefix. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09update VFS documentation for method changes.Al Viro2-10/+39
Signed-off-by: Al Viro <[email protected]>
2010-08-09All filesystems that need invalidate_inode_buffers() are doing that explicitlyAl Viro1-1/+0
Signed-off-by: Al Viro <[email protected]>
2010-08-09convert remaining ->clear_inode() to ->evict_inode()Al Viro34-59/+94
Signed-off-by: Al Viro <[email protected]>
2010-08-09Make ->drop_inode() just return whether inode needs to be droppedAl Viro10-103/+60
... and let iput_final() do the actual eviction or retention Signed-off-by: Al Viro <[email protected]>
2010-08-09fs/inode.c:clear_inode() is goneAl Viro2-25/+4
Signed-off-by: Al Viro <[email protected]>
2010-08-09fs/inode.c:evict() doesn't care about delete vs. non-delete paths nowAl Viro1-4/+4
Signed-off-by: Al Viro <[email protected]>
2010-08-09->delete_inode() is goneAl Viro2-3/+0
Signed-off-by: Al Viro <[email protected]>
2010-08-09convert ext4 to ->evict_inode()Al Viro4-10/+16
pretty much brute-force... Signed-off-by: Al Viro <[email protected]>
2010-08-09convert logfs to ->evict_inode()Al Viro3-36/+31
Signed-off-by: Al Viro <[email protected]>
2010-08-09logfs: get rid of magical inodesAl Viro6-41/+31
ordering problems at ->kill_sb() time are solved by doing iput() of these suckers in ->put_super() Signed-off-by: Al Viro <[email protected]>