aboutsummaryrefslogtreecommitdiff
path: root/kernel/futex.c
AgeCommit message (Collapse)AuthorFilesLines
2015-05-19locking/arch: Rename set_mb() to smp_store_mb()Peter Zijlstra1-1/+1
Since set_mb() is really about an smp_mb() -- not a IO/DMA barrier like mb() rename it to match the recent smp_load_acquire() and smp_store_release(). Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-05-08futex: Implement lockless wakeupsDavidlohr Bueso1-16/+17
Given the overall futex architecture, any chance of reducing hb->lock contention is welcome. In this particular case, using wake-queues to enable lockless wakeups addresses very much real world performance concerns, even cases of soft-lockups in cases of large amounts of blocked tasks (which is not hard to find in large boxes, using but just a handful of futex). At the lowest level, this patch can reduce latency of a single thread attempting to acquire hb->lock in highly contended scenarios by a up to 2x. At lower counts of nr_wake there are no regressions, confirming, of course, that the wake_q handling overhead is practically non existent. For instance, while a fair amount of variation, the extended pef-bench wakeup benchmark shows for a 20 core machine the following avg per-thread time to wakeup its share of tasks: nr_thr ms-before ms-after 16 0.0590 0.0215 32 0.0396 0.0220 48 0.0417 0.0182 64 0.0536 0.0236 80 0.0414 0.0097 96 0.0672 0.0152 Naturally, this can cause spurious wakeups. However there is no core code that cannot handle them afaict, and furthermore tglx does have the point that other events can already trigger them anyway. Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Mason <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: George Spelvin <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Manfred Spraul <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-04-22futex: Remove bogus hrtimer_active() checkThomas Gleixner1-4/+1
The check for hrtimer_active() after starting the timer is pointless. If the timer is inactive it has expired already and therefor the task pointer is already NULL. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Viresh Kumar <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Frederic Weisbecker <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-02-24Merge tag 'v4.0-rc1' into locking/core, to refresh the tree before merging ↵Ingo Molnar1-1/+1
new changes Signed-off-by: Ingo Molnar <[email protected]>
2015-02-18locking/futex: Check PF_KTHREAD rather than !p->mm to filter out kthreadsOleg Nesterov1-1/+1
attach_to_pi_owner() checks p->mm to prevent attaching to kthreads and this looks doubly wrong: 1. It should actually check PF_KTHREAD, kthread can do use_mm(). 2. If this task is not kthread and it is actually the lock owner we can wrongly return -EPERM instead of -ESRCH or retry-if-EAGAIN. And note that this wrong EPERM is the likely case unless the exiting task is (auto)reaped quickly, we check ->mm before PF_EXITING. Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Darren Hart <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Jerome Marchand <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mateusz Guzik <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-02-12all arches, signal: move restart_block to struct task_structAndy Lutomirski1-1/+1
If an attacker can cause a controlled kernel stack overflow, overwriting the restart block is a very juicy exploit target. This is because the restart_block is held in the same memory allocation as the kernel stack. Moving the restart block to struct task_struct prevents this exploit by making the restart_block harder to locate. Note that there are other fields in thread_info that are also easy targets, at least on some architectures. It's also a decent simplification, since the restart code is more or less identical on all architectures. [[email protected]: metag: align thread_info::supervisor_stack] Signed-off-by: Andy Lutomirski <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Al Viro <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kees Cook <[email protected]> Cc: David Miller <[email protected]> Acked-by: Richard Weinberger <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Russell King <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Haavard Skinnemoen <[email protected]> Cc: Hans-Christian Egtvedt <[email protected]> Cc: Steven Miao <[email protected]> Cc: Mark Salter <[email protected]> Cc: Aurelien Jacquiot <[email protected]> Cc: Mikael Starvik <[email protected]> Cc: Jesper Nilsson <[email protected]> Cc: David Howells <[email protected]> Cc: Richard Kuo <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Michal Simek <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Helge Deller <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Tested-by: Michael Ellerman <[email protected]> (powerpc) Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Chen Liqin <[email protected]> Cc: Lennox Wu <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Max Filippov <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Guenter Roeck <[email protected]> Signed-off-by: James Hogan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-01-19futex: Fix argument handling in futex_lock_pi() callsMichael Kerrisk1-3/+3
This patch fixes two separate buglets in calls to futex_lock_pi(): * Eliminate unused 'detect' argument * Change unused 'timeout' argument of FUTEX_TRYLOCK_PI to NULL The 'detect' argument of futex_lock_pi() seems never to have been used (when it was included with the initial PI mutex implementation in Linux 2.6.18, all checks against its value were disabled by ANDing against 0 (i.e., if (detect... && 0)), and with commit 778e9a9c3e7193ea9f434f382947155ffb59c755, any mention of this argument in futex_lock_pi() went way altogether. Its presence now serves only to confuse readers of the code, by giving the impression that the futex() FUTEX_LOCK_PI operation actually does use the 'val' argument. This patch removes the argument. The futex_lock_pi() call that corresponds to FUTEX_TRYLOCK_PI includes 'timeout' as one of its arguments. This misleads the reader into thinking that the FUTEX_TRYLOCK_PI operation does employ timeouts for some sensible purpose; but it does not. Indeed, it cannot, because the checks at the start of sys_futex() exclude FUTEX_TRYLOCK_PI from the set of operations that do copy_from_user() on the timeout argument. So, in the FUTEX_TRYLOCK_PI futex_lock_pi() call it would be simplest to change 'timeout' to 'NULL'. This patch does that. Signed-off-by: Michael Kerrisk <[email protected]> Reviewed-by: Darren Hart <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-10-26futex: Fix a race condition between REQUEUE_PI and task deathBrian Silverman1-11/+11
free_pi_state and exit_pi_state_list both clean up futex_pi_state's. exit_pi_state_list takes the hb lock first, and most callers of free_pi_state do too. requeue_pi doesn't, which means free_pi_state can free the pi_state out from under exit_pi_state_list. For example: task A | task B exit_pi_state_list | pi_state = | curr->pi_state_list->next | | futex_requeue(requeue_pi=1) | // pi_state is the same as | // the one in task A | free_pi_state(pi_state) | list_del_init(&pi_state->list) | kfree(pi_state) list_del_init(&pi_state->list) | Move the free_pi_state calls in requeue_pi to before it drops the hb locks which it's already holding. [ tglx: Removed a pointless free_pi_state() call and the hb->lock held debugging. The latter comes via a seperate patch ] Signed-off-by: Brian Silverman <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-10-26futex: Mention key referencing differences between shared and private futexesDavidlohr Bueso1-4/+10
Update our documentation as of fix 76835b0ebf8 (futex: Ensure get_futex_key_refs() always implies a barrier). Explicitly state that we don't do key referencing for private futexes. Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Matteo Franchin <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Darren Hart <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul E. McKenney <[email protected]> Acked-by: Catalin Marinas <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-10-18futex: Ensure get_futex_key_refs() always implies a barrierCatalin Marinas1-0/+2
Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up) changes the futex code to avoid taking a lock when there are no waiters. This code has been subsequently fixed in commit 11d4616bd07f (futex: revert back to the explicit waiter counting code). Both the original commit and the fix-up rely on get_futex_key_refs() to always imply a barrier. However, for private futexes, none of the cases in the switch statement of get_futex_key_refs() would be hit and the function completes without a memory barrier as required before checking the "waiters" in futex_wake() -> hb_waiters_pending(). The consequence is a race with a thread waiting on a futex on another CPU, allowing the waker thread to read "waiters == 0" while the waiter thread to have read "futex_val == locked" (in kernel). Without this fix, the problem (user space deadlocks) can be seen with Android bionic's mutex implementation on an arm64 multi-cluster system. Signed-off-by: Catalin Marinas <[email protected]> Reported-by: Matteo Franchin <[email protected]> Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up) Acked-by: Davidlohr Bueso <[email protected]> Tested-by: Mike Galbraith <[email protected]> Cc: <[email protected]> Cc: Darren Hart <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-09-12futex: Unlock hb->lock in futex_wait_requeue_pi() error pathThomas Gleixner1-0/+1
futex_wait_requeue_pi() calls futex_wait_setup(). If futex_wait_setup() succeeds it returns with hb->lock held and preemption disabled. Now the sanity check after this does: if (match_futex(&q.key, &key2)) { ret = -EINVAL; goto out_put_keys; } which releases the keys but does not release hb->lock. So we happily return to user space with hb->lock held and therefor preemption disabled. Unlock hb->lock before taking the exit route. Reported-by: Dave "Trinity" Jones <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Darren Hart <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanos Signed-off-by: Thomas Gleixner <[email protected]>
2014-06-21futex: Simplify futex_lock_pi_atomic() and make it more robustThomas Gleixner1-87/+61
futex_lock_pi_atomic() is a maze of retry hoops and loops. Reduce it to simple and understandable states: First step is to lookup existing waiters (state) in the kernel. If there is an existing waiter, validate it and attach to it. If there is no existing waiter, check the user space value If the TID encoded in the user space value is 0, take over the futex preserving the owner died bit. If the TID encoded in the user space value is != 0, lookup the owner task, validate it and attach to it. Reduces text size by 128 bytes on x8664. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Kees Cook <[email protected]> Cc: [email protected] Cc: Darren Hart <[email protected]> Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1406131137020.5170@nanos Signed-off-by: Thomas Gleixner <[email protected]>
2014-06-21futex: Split out the first waiter attachment from lookup_pi_state()Thomas Gleixner1-14/+28
We want to be a bit more clever in futex_lock_pi_atomic() and separate the possible states. Split out the code which attaches the first waiter to the owner into a separate function. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Darren Hart <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Kees Cook <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-06-21futex: Split out the waiter check from lookup_pi_state()Thomas Gleixner1-67/+71
We want to be a bit more clever in futex_lock_pi_atomic() and separate the possible states. Split out the waiter verification into a separate function. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Darren Hart <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Kees Cook <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-06-21futex: Use futex_top_waiter() in lookup_pi_state()Thomas Gleixner1-63/+61
No point in open coding the same function again. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Darren Hart <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Kees Cook <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-06-21futex: Make unlock_pi more robustThomas Gleixner1-51/+25
The kernel tries to atomically unlock the futex without checking whether there is kernel state associated to the futex. So if user space manipulated the user space value, this will leave kernel internal state around associated to the owner task. For robustness sake, lookup first whether there are waiters on the futex. If there are waiters, wake the top priority waiter with all the proper sanity checks applied. If there are no waiters, do the atomic release. We do not have to preserve the waiters bit in this case, because a potentially incoming waiter is blocked on the hb->lock and will acquire the futex atomically. We neither have to preserve the owner died bit. The caller is the owner and it was supposed to cleanup the mess. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Darren Hart <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Kees Cook <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-06-21rtmutex: Confine deadlock logic to futexThomas Gleixner1-5/+5
The deadlock logic is only required for futexes. Remove the extra arguments for the public functions and also for the futex specific ones which get always called with deadlock detection enabled. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Steven Rostedt <[email protected]>
2014-06-08Merge branch 'next' (accumulated 3.16 merge window patches) into masterLinus Torvalds1-2/+2
Now that 3.15 is released, this merges the 'next' branch into 'master', bringing us to the normal situation where my 'master' branch is the merge window. * accumulated work in next: (6809 commits) ufs: sb mutex merge + mutex_destroy powerpc: update comments for generic idle conversion cris: update comments for generic idle conversion idle: remove cpu_idle() forward declarations nbd: zero from and len fields in NBD_CMD_DISCONNECT. mm: convert some level-less printks to pr_* MAINTAINERS: adi-buildroot-devel is moderated MAINTAINERS: add linux-api for review of API/ABI changes mm/kmemleak-test.c: use pr_fmt for logging fs/dlm/debug_fs.c: replace seq_printf by seq_puts fs/dlm/lockspace.c: convert simple_str to kstr fs/dlm/config.c: convert simple_str to kstr mm: mark remap_file_pages() syscall as deprecated mm: memcontrol: remove unnecessary memcg argument from soft limit functions mm: memcontrol: clean up memcg zoneinfo lookup mm/memblock.c: call kmemleak directly from memblock_(alloc|free) mm/mempool.c: update the kmemleak stack trace for mempool allocations lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations mm: introduce kmemleak_update_trace() mm/kmemleak.c: use %u to print ->checksum ...
2014-06-05futex: Make lookup_pi_state more robustThomas Gleixner1-28/+106
The current implementation of lookup_pi_state has ambigous handling of the TID value 0 in the user space futex. We can get into the kernel even if the TID value is 0, because either there is a stale waiters bit or the owner died bit is set or we are called from the requeue_pi path or from user space just for fun. The current code avoids an explicit sanity check for pid = 0 in case that kernel internal state (waiters) are found for the user space address. This can lead to state leakage and worse under some circumstances. Handle the cases explicit: Waiter | pi_state | pi->owner | uTID | uODIED | ? [1] NULL | --- | --- | 0 | 0/1 | Valid [2] NULL | --- | --- | >0 | 0/1 | Valid [3] Found | NULL | -- | Any | 0/1 | Invalid [4] Found | Found | NULL | 0 | 1 | Valid [5] Found | Found | NULL | >0 | 1 | Invalid [6] Found | Found | task | 0 | 1 | Valid [7] Found | Found | NULL | Any | 0 | Invalid [8] Found | Found | task | ==taskTID | 0/1 | Valid [9] Found | Found | task | 0 | 0 | Invalid [10] Found | Found | task | !=taskTID | 0/1 | Invalid [1] Indicates that the kernel can acquire the futex atomically. We came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit. [2] Valid, if TID does not belong to a kernel thread. If no matching thread is found then it indicates that the owner TID has died. [3] Invalid. The waiter is queued on a non PI futex [4] Valid state after exit_robust_list(), which sets the user space value to FUTEX_WAITERS | FUTEX_OWNER_DIED. [5] The user space value got manipulated between exit_robust_list() and exit_pi_state_list() [6] Valid state after exit_pi_state_list() which sets the new owner in the pi_state but cannot access the user space value. [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set. [8] Owner and user space value match [9] There is no transient state which sets the user space TID to 0 except exit_robust_list(), but this is indicated by the FUTEX_OWNER_DIED bit. See [4] [10] There is no transient state which leaves owner and user space TID out of sync. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Kees Cook <[email protected]> Cc: Will Drewry <[email protected]> Cc: Darren Hart <[email protected]> Cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>
2014-06-05futex: Always cleanup owner tid in unlock_piThomas Gleixner1-22/+18
If the owner died bit is set at futex_unlock_pi, we currently do not cleanup the user space futex. So the owner TID of the current owner (the unlocker) persists. That's observable inconsistant state, especially when the ownership of the pi state got transferred. Clean it up unconditionally. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Kees Cook <[email protected]> Cc: Will Drewry <[email protected]> Cc: Darren Hart <[email protected]> Cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>
2014-06-05futex: Validate atomic acquisition in futex_lock_pi_atomic()Thomas Gleixner1-3/+11
We need to protect the atomic acquisition in the kernel against rogue user space which sets the user space futex to 0, so the kernel side acquisition succeeds while there is existing state in the kernel associated to the real owner. Verify whether the futex has waiters associated with kernel state. If it has, return -EINVAL. The state is corrupted already, so no point in cleaning it up. Subsequent calls will fail as well. Not our problem. [ tglx: Use futex_top_waiter() and explain why we do not need to try restoring the already corrupted user space state. ] Signed-off-by: Darren Hart <[email protected]> Cc: Kees Cook <[email protected]> Cc: Will Drewry <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-05futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 ↵Thomas Gleixner1-0/+25
in futex_requeue(..., requeue_pi=1) If uaddr == uaddr2, then we have broken the rule of only requeueing from a non-pi futex to a pi futex with this call. If we attempt this, then dangling pointers may be left for rt_waiter resulting in an exploitable condition. This change brings futex_requeue() in line with futex_wait_requeue_pi() which performs the same check as per commit 6f7b0a2a5c0f ("futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi()") [ tglx: Compare the resulting keys as well, as uaddrs might be different depending on the mapping ] Fixes CVE-2014-3153. Reported-by: Pinkie Pie Signed-off-by: Will Drewry <[email protected]> Signed-off-by: Kees Cook <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Darren Hart <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-03Merge branch 'locking-core-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull core locking updates from Ingo Molnar: "The main changes in this cycle were: - reduced/streamlined smp_mb__*() interface that allows more usecases and makes the existing ones less buggy, especially in rarer architectures - add rwsem implementation comments - bump up lockdep limits" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) rwsem: Add comments to explain the meaning of the rwsem's count field lockdep: Increase static allocations arch: Mass conversion of smp_mb__*() arch,doc: Convert smp_mb__*() arch,xtensa: Convert smp_mb__*() arch,x86: Convert smp_mb__*() arch,tile: Convert smp_mb__*() arch,sparc: Convert smp_mb__*() arch,sh: Convert smp_mb__*() arch,score: Convert smp_mb__*() arch,s390: Convert smp_mb__*() arch,powerpc: Convert smp_mb__*() arch,parisc: Convert smp_mb__*() arch,openrisc: Convert smp_mb__*() arch,mn10300: Convert smp_mb__*() arch,mips: Convert smp_mb__*() arch,metag: Convert smp_mb__*() arch,m68k: Convert smp_mb__*() arch,m32r: Convert smp_mb__*() arch,ia64: Convert smp_mb__*() ...
2014-05-19futex: Prevent attaching to kernel threadsThomas Gleixner1-0/+5
We happily allow userspace to declare a random kernel thread to be the owner of a user space PI futex. Found while analysing the fallout of Dave Jones syscall fuzzer. We also should validate the thread group for private futexes and find some fast way to validate whether the "alleged" owner has RW access on the file which backs the SHM, but that's a separate issue. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Dave Jones <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Darren Hart <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Clark Williams <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Roland McGrath <[email protected]> Cc: Carlos ODonell <[email protected]> Cc: Jakub Jelinek <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected]
2014-05-19futex: Add another early deadlock detection checkThomas Gleixner1-13/+34
Dave Jones trinity syscall fuzzer exposed an issue in the deadlock detection code of rtmutex: http://lkml.kernel.org/r/[email protected] That underlying issue has been fixed with a patch to the rtmutex code, but the futex code must not call into rtmutex in that case because - it can detect that issue early - it avoids a different and more complex fixup for backing out If the user space variable got manipulated to 0x80000000 which means no lock holder, but the waiters bit set and an active pi_state in the kernel is found we can figure out the recursive locking issue by looking at the pi_state owner. If that is the current task, then we can safely return -EDEADLK. The check should have been added in commit 59fa62451 (futex: Handle futex_pi OWNER_DIED take over correctly) already, but I did not see the above issue caused by user space manipulation back then. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Dave Jones <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Darren Hart <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Clark Williams <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Roland McGrath <[email protected]> Cc: Carlos ODonell <[email protected]> Cc: Jakub Jelinek <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected]
2014-04-18arch: Mass conversion of smp_mb__*()Peter Zijlstra1-2/+2
Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-04-12futex: update documentation for ordering guaranteesDavidlohr Bueso1-9/+23
Commits 11d4616bd07f ("futex: revert back to the explicit waiter counting code") and 69cd9eba3886 ("futex: avoid race between requeue and wake") changed some of the finer details of how we think about futexes. One was a late fix and the other a consequence of overlooking the whole requeuing logic. The first change caused our documentation to be incorrect, and the second made us aware that we need to explicitly add more details to it. Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-09futex: avoid race between requeue and wakeLinus Torvalds1-0/+5
Jan Stancek reported: "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP) occasionally fails, because some threads fail to wake up. Testcase creates 5 threads, which are all waiting on same condition. Main thread then calls pthread_cond_broadcast() without holding mutex, which calls: futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..) This immediately wakes up single thread A, which unlocks mutex and tries to wake up another thread: futex(uaddr2, FUTEX_WAKE_PRIVATE, 1) If thread A manages to call futex_wake() before any waiters are requeued for uaddr2, no other thread is woken up" The ordering constraints for the hash bucket waiter counting are that the waiter counts have to be incremented _before_ getting the spinlock (because the spinlock acts as part of the memory barrier), but the "requeue" operation didn't honor those rules, and nobody had even thought about that case. This fairly simple patch just increments the waiter count for the target hash bucket (hb2) when requeing a futex before taking the locks. It then decrements them again after releasing the lock - the code that actually moves the futex(es) between hash buckets will do the additional required waiter count housekeeping. Reported-and-tested-by: Jan Stancek <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] # 3.14 Signed-off-by: Linus Torvalds <[email protected]>
2014-03-31Merge branch 'core-locking-for-linus' of ↵Linus Torvalds1-13/+24
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core locking updates from Ingo Molnar: "The biggest change is the MCS spinlock generalization changes from Tim Chen, Peter Zijlstra, Jason Low et al. There's also lockdep fixes/enhancements from Oleg Nesterov, in particular a false negative fix related to lockdep_set_novalidate_class() usage" * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) locking/mutex: Fix debug checks locking/mutexes: Add extra reschedule point locking/mutexes: Introduce cancelable MCS lock for adaptive spinning locking/mutexes: Unlock the mutex without the wait_lock locking/mutexes: Modify the way optimistic spinners are queued locking/mutexes: Return false if task need_resched() in mutex_can_spin_on_owner() locking: Move mcs_spinlock.h into kernel/locking/ m68k: Skip futex_atomic_cmpxchg_inatomic() test futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test Revert "sched/wait: Suppress Sparse 'variable shadowing' warning" lockdep: Change lockdep_set_novalidate_class() to use _and_name lockdep: Change mark_held_locks() to check hlock->check instead of lockdep_no_validate lockdep: Don't create the wrong dependency on hlock->check == 0 lockdep: Make held_lock->check and "int check" argument bool locking/mcs: Allow architecture specific asm files to be used for contended case locking/mcs: Order the header files in Kbuild of each architecture in alphabetical order sched/wait: Suppress Sparse 'variable shadowing' warning hung_task/Documentation: Fix hung_task_warnings description locking/mcs: Allow architectures to hook in to contended paths locking/mcs: Micro-optimize the MCS code, add extra comments ...
2014-03-20futex: revert back to the explicit waiter counting codeLinus Torvalds1-10/+43
Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid taking the hb->lock if there's nothing to wake up") causes java threads getting stuck on futexes when runing specjbb on a power7 numa box. The cause appears to be that the powerpc spinlocks aren't using the same ticket lock model that we use on x86 (and other) architectures, which in turn result in the "spin_is_locked()" test in hb_waiters_pending() occasionally reporting an unlocked spinlock even when there are pending waiters. So this reinstates Davidlohr Bueso's original explicit waiter counting code, which I had convinced Davidlohr to drop in favor of figuring out the pending waiters by just using the existing state of the spinlock and the wait queue. Reported-and-tested-by: Srikar Dronamraju <[email protected]> Original-code-by: Davidlohr Bueso <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-03-03futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() testHeiko Carstens1-13/+24
If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there is no runtime check necessary, allow to skip the test within futex_init(). This allows to get rid of some code which would always give the same result, and also allows the compiler to optimize a couple of if statements away. Signed-off-by: Heiko Carstens <[email protected]> Cc: Finn Thain <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris Signed-off-by: Thomas Gleixner <[email protected]>
2014-01-20Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: - Add the initial implementation of SCHED_DEADLINE support: a real-time scheduling policy where tasks that meet their deadlines and periodically execute their instances in less than their runtime quota see real-time scheduling and won't miss any of their deadlines. Tasks that go over their quota get delayed (Available to privileged users for now) - Clean up and fix preempt_enable_no_resched() abuse all around the tree - Do sched_clock() performance optimizations on x86 and elsewhere - Fix and improve auto-NUMA balancing - Fix and clean up the idle loop - Apply various cleanups and fixes * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) sched: Fix __sched_setscheduler() nice test sched: Move SCHED_RESET_ON_FORK into attr::sched_flags sched: Fix up attr::sched_priority warning sched: Fix up scheduler syscall LTP fails sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls sched/core: Fix htmldocs warnings sched/deadline: No need to check p if dl_se is valid sched/deadline: Remove unused variables sched/deadline: Fix sparse static warnings m68k: Fix build warning in mac_via.h sched, thermal: Clean up preempt_enable_no_resched() abuse sched, net: Fixup busy_loop_us_clock() sched, net: Clean up preempt_enable_no_resched() abuse sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding sched/preempt, locking: Rework local_bh_{dis,en}able() sched/clock, x86: Avoid a runtime condition in native_sched_clock() sched/clock: Fix up clear_sched_clock_stable() sched/clock, x86: Use a static_key for sched_clock_stable sched/clock: Remove local_irq_disable() from the clocks sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs ...
2014-01-16futexes: Fix futex_hashsize initializationHeiko Carstens1-2/+4
"futexes: Increase hash table size for better performance" introduces a new alloc_large_system_hash() call. alloc_large_system_hash() however may allocate less memory than requested, e.g. limited by MAX_ORDER. Hence pass a pointer to alloc_large_system_hash() which will contain the hash shift when the function returns. Afterwards correctly set futex_hashsize. Fixes a crash on s390 where the requested allocation size was 4MB but only 1MB was allocated. Signed-off-by: Heiko Carstens <[email protected]> Cc: Darren Hart <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Waiman Long <[email protected]> Cc: Jason Low <[email protected]> Cc: Davidlohr Bueso <[email protected]> Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13rtmutex: Turn the plist into an rb-treePeter Zijlstra1-0/+2
Turn the pi-chains from plist to rb-tree, in the rt_mutex code, and provide a proper comparison function for -deadline and -priority tasks. This is done mainly because: - classical prio field of the plist is just an int, which might not be enough for representing a deadline; - manipulating such a list would become O(nr_deadline_tasks), which might be to much, as the number of -deadline task increases. Therefore, an rb-tree is used, and tasks are queued in it according to the following logic: - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the one with the higher (lower, actually!) prio wins; - among a -priority and a -deadline task, the latter always wins; - among two -deadline tasks, the one with the earliest deadline wins. Queueing and dequeueing functions are changed accordingly, for both the list of a task's pi-waiters and the list of tasks blocked on a pi-lock. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Dario Faggioli <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-again-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13futexes: Avoid taking the hb->lock if there's nothing to wake upDavidlohr Bueso1-25/+92
In futex_wake() there is clearly no point in taking the hb->lock if we know beforehand that there are no tasks to be woken. While the hash bucket's plist head is a cheap way of knowing this, we cannot rely 100% on it as there is a racy window between the futex_wait call and when the task is actually added to the plist. To this end, we couple it with the spinlock check as tasks trying to enter the critical region are most likely potential waiters that will be added to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters. Furthermore, the futex ordering guarantees are preserved, ensuring that waiters either observe the changed user space value before blocking or is woken by a concurrent waker. For wakers, this is done by relying on the barriers in get_futex_key_refs() -- for archs that do not have implicit mb in atomic_inc(), we explicitly add them through a new futex_get_mm function. For waiters we rely on the fact that spin_lock calls already update the head counter, so spinners are visible even if the lock hasn't been acquired yet. For more details please refer to the updated comments in the code and related discussion: https://lkml.org/lkml/2013/11/26/556 Special thanks to tglx for careful review and feedback. Suggested-by: Linus Torvalds <[email protected]> Reviewed-by: Darren Hart <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Jeff Mahoney <[email protected]> Cc: Scott Norton <[email protected]> Cc: Tom Vaden <[email protected]> Cc: Aswin Chandramouleeswaran <[email protected]> Cc: Waiman Long <[email protected]> Cc: Jason Low <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13futexes: Document multiprocessor ordering guaranteesThomas Gleixner1-0/+57
That's essential, if you want to hack on futexes. Reviewed-by: Darren Hart <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Jeff Mahoney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Scott Norton <[email protected]> Cc: Tom Vaden <[email protected]> Cc: Aswin Chandramouleeswaran <[email protected]> Cc: Waiman Long <[email protected]> Cc: Jason Low <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13futexes: Increase hash table size for better performanceDavidlohr Bueso1-7/+19
Currently, the futex global hash table suffers from its fixed, smallish (for today's standards) size of 256 entries, as well as its lack of NUMA awareness. Large systems, using many futexes, can be prone to high amounts of collisions; where these futexes hash to the same bucket and lead to extra contention on the same hb->lock. Furthermore, cacheline bouncing is a reality when we have multiple hb->locks residing on the same cacheline and different futexes hash to adjacent buckets. This patch keeps the current static size of 16 entries for small systems, or otherwise, 256 * ncpus (or larger as we need to round the number to a power of 2). Note that this number of CPUs accounts for all CPUs that can ever be available in the system, taking into consideration things like hotpluging. While we do impose extra overhead at bootup by making the hash table larger, this is a one time thing, and does not shadow the benefits of this patch. Furthermore, as suggested by tglx, by cache aligning the hash buckets we can avoid access across cacheline boundaries and also avoid massive cache line bouncing if multiple cpus are hammering away at different hash buckets which happen to reside in the same cache line. Also, similar to other core kernel components (pid, dcache, tcp), by using alloc_large_system_hash() we benefit from its NUMA awareness and thus the table is distributed among the nodes instead of in a single one. For a custom microbenchmark that pounds on the uaddr hashing -- making the wait path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server: +---------+--------------------+------------------------+-----------------------+-------------------------------+ | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) | +---------+--------------------+------------------------+-----------------------+-------------------------------+ |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             | |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             | |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             | |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             | |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             | |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              | +---------+--------------------+------------------------+-----------------------+-------------------------------+ Reviewed-by: Darren Hart <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Reviewed-by: Waiman Long <[email protected]> Reviewed-and-tested-by: Jason Low <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Jeff Mahoney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Scott Norton <[email protected]> Cc: Tom Vaden <[email protected]> Cc: Aswin Chandramouleeswaran <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13futexes: Clean up various detailsJason Low1-27/+12
- Remove unnecessary head variables. - Delete unused parameter in queue_unlock(). Reviewed-by: Darren Hart <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Jason Low <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Jeff Mahoney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Scott Norton <[email protected]> Cc: Tom Vaden <[email protected]> Cc: Aswin Chandramouleeswaran <[email protected]> Cc: Waiman Long <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-12-12futex: move user address verification up to common codeLinus Torvalds1-2/+3
When debugging the read-only hugepage case, I was confused by the fact that get_futex_key() did an access_ok() only for the non-shared futex case, since the user address checking really isn't in any way specific to the private key handling. Now, it turns out that the shared key handling does effectively do the equivalent checks inside get_user_pages_fast() (it doesn't actually check the address range on x86, but does check the page protections for being a user page). So it wasn't actually a bug, but the fact that we treat the address differently for private and shared futexes threw me for a loop. Just move the check up, so that it gets done for both cases. Also, use the 'rw' parameter for the type, even if it doesn't actually matter any more (it's a historical artifact of the old racy i386 "page faults from kernel space don't check write protections"). Cc: Thomas Gleixner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-12-12futex: fix handling of read-only-mapped hugepagesLinus Torvalds1-1/+1
The hugepage code had the exact same bug that regular pages had in commit 7485d0d3758e ("futexes: Remove rw parameter from get_futex_key()"). The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix regression with read only mappings"), but the transparent hugepage case (added in a5b338f2b0b1: "thp: update futex compound knowledge") case remained broken. Found by Dave Jones and his trinity tool. Reported-and-tested-by: Dave Jones <[email protected]> Cc: [email protected] # v2.6.38+ Acked-by: Thomas Gleixner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Darren Hart <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-11-06locking: Move the rtmutex code to kernel/locking/Peter Zijlstra1-1/+1
Suggested-by: Ingo Molnar <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] [ Fixed build failure in kernel/rcu/tree_plugin.h. ] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-25futex: Use freezable blocking callColin Cross1-1/+2
Avoid waking up every thread sleeping in a futex_wait call during suspend and resume by calling a freezable blocking call. Previous patches modified the freezer to avoid sending wakeups to threads that are blocked in freezable blocking calls. This call was selected to be converted to a freezable call because it doesn't hold any locks or release any resources when interrupted that might be needed by another freezing task or a kernel driver during suspend, and is a common site where idle userspace tasks are blocked. Signed-off-by: Colin Cross <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: [email protected] Cc: Tejun Heo <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Darren Hart <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Al Viro <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-25futex: Take hugepages into account when generating futex_keyZhang Yi1-1/+2
The futex_keys of process shared futexes are generated from the page offset, the mapping host and the mapping index of the futex user space address. This should result in an unique identifier for each futex. Though this is not true when futexes are located in different subpages of an hugepage. The reason is, that the mapping index for all those futexes evaluates to the index of the base page of the hugetlbfs mapping. So a futex at offset 0 of the hugepage mapping and another one at offset PAGE_SIZE of the same hugepage mapping have identical futex_keys. This happens because the futex code blindly uses page->index. Steps to reproduce the bug: 1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0 and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs mapping. The mutexes must be initialized as PTHREAD_PROCESS_SHARED because PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as their keys solely depend on the user space address. 2. Lock mutex1 and mutex2 3. Create thread1 and in the thread function lock mutex1, which results in thread1 blocking on the locked mutex1. 4. Create thread2 and in the thread function lock mutex2, which results in thread2 blocking on the locked mutex2. 5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2 still blocks on mutex2 because the futex_key points to mutex1. To solve this issue we need to take the normal page index of the page which contains the futex into account, if the futex is in an hugetlbfs mapping. In other words, we calculate the normal page mapping index of the subpage in the hugetlbfs mapping. Mappings which are not based on hugetlbfs are not affected and still use page->index. Thanks to Mel Gorman who provided a patch for adding proper evaluation functions to the hugetlbfs code to avoid exposing hugetlbfs specific details to the futex code. [ tglx: Massaged changelog ] Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Jiang Biao <[email protected]> Tested-by: Ma Chenggong <[email protected]> Reviewed-by: 'Mel Gorman' <[email protected]> Acked-by: 'Darren Hart' <[email protected]> Cc: 'Peter Zijlstra' <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com Signed-off-by: Thomas Gleixner <[email protected]>
2013-03-12futex: fix kernel-doc notation and spelloRandy Dunlap1-23/+23
Fix kernel-doc warning in futex.c and convert 'Returns' to the new Return: kernel-doc notation format. Warning(kernel/futex.c:2286): Excess function parameter 'clockrt' description in 'futex_wait_requeue_pi' Fix one spello. Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-02-27more file_inode() open-coded instancesAl Viro1-1/+1
Signed-off-by: Al Viro <[email protected]>
2013-02-22Merge branch 'core-locking-for-linus' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core locking changes from Ingo Molnar: "The biggest change is the rwsem lock-steal improvements, both to the assembly optimized and the spinlock based variants. The other notable change is the clean up of the seqlock implementation to be based on the seqcount infrastructure. The rest is assorted smaller debuggability, cleanup and continued -rt locking changes." * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rwsem-spinlock: Implement writer lock-stealing for better scalability futex: Revert "futex: Mark get_robust_list as deprecated" generic: Use raw local irq variant for generic cmpxchg lockdep: Selftest: convert spinlock to raw spinlock seqlock: Use seqcount infrastructure seqlock: Remove unused functions ntp: Make ntp_lock raw intel_idle: Convert i7300_idle_lock to raw_spinlock locking: Various static lock initializer fixes lockdep: Print more info when MAX_LOCK_DEPTH is exceeded rwsem: Implement writer lock-stealing for better scalability lockdep: Silence warning if CONFIG_LOCKDEP isn't set watchdog: Use local_clock for get_timestamp() lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug() locking/stat: Fix a typo
2013-02-19futex: Revert "futex: Mark get_robust_list as deprecated"Thomas Gleixner1-2/+0
This reverts commit ec0c4274e33c0373e476b73e01995c53128f1257. get_robust_list() is in use and a removal would break existing user space. With the permission checks in place it's not longer a security hole. Remove the deprecation warnings. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected]
2013-02-07sched/rt: Move rt specific bits into new header fileClark Williams1-0/+1
Move rt scheduler definitions out of include/linux/sched.h into new file include/linux/sched/rt.h Signed-off-by: Clark Williams <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-11-26futex: avoid wake_futex() for a PI futex_qDarren Hart1-1/+17
Dave Jones reported a bug with futex_lock_pi() that his trinity test exposed. Sometime between queue_me() and taking the q.lock_ptr, the lock_ptr became NULL, resulting in a crash. While futex_wake() is careful to not call wake_futex() on futex_q's with a pi_state or an rt_waiter (which are either waiting for a futex_unlock_pi() or a PI futex_requeue()), futex_wake_op() and futex_requeue() do not perform the same test. Update futex_wake_op() and futex_requeue() to test for q.pi_state and q.rt_waiter and abort with -EINVAL if detected. To ensure any future breakage is caught, add a WARN() to wake_futex() if the same condition is true. This fix has seen 3 hours of testing with "trinity -c futex" on an x86_64 VM with 4 CPUS. [[email protected]: tidy up the WARN()] Signed-off-by: Darren Hart <[email protected]> Reported-by: Dave Jones <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: John Kacur <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-11-01futex: Handle futex_pi OWNER_DIED take over correctlyThomas Gleixner1-19/+22
Siddhesh analyzed a failure in the take over of pi futexes in case the owner died and provided a workaround. See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076 The detailed problem analysis shows: Futex F is initialized with PTHREAD_PRIO_INHERIT and PTHREAD_MUTEX_ROBUST_NP attributes. T1 lock_futex_pi(F); T2 lock_futex_pi(F); --> T2 blocks on the futex and creates pi_state which is associated to T1. T1 exits --> exit_robust_list() runs --> Futex F userspace value TID field is set to 0 and FUTEX_OWNER_DIED bit is set. T3 lock_futex_pi(F); --> Succeeds due to the check for F's userspace TID field == 0 --> Claims ownership of the futex and sets its own TID into the userspace TID field of futex F --> returns to user space T1 --> exit_pi_state_list() --> Transfers pi_state to waiter T2 and wakes T2 via rt_mutex_unlock(&pi_state->mutex) T2 --> acquires pi_state->mutex and gains real ownership of the pi_state --> Claims ownership of the futex and sets its own TID into the userspace TID field of futex F --> returns to user space T3 --> observes inconsistent state This problem is independent of UP/SMP, preemptible/non preemptible kernels, or process shared vs. private. The only difference is that certain configurations are more likely to expose it. So as Siddhesh correctly analyzed the following check in futex_lock_pi_atomic() is the culprit: if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) { We check the userspace value for a TID value of 0 and take over the futex unconditionally if that's true. AFAICT this check is there as it is correct for a different corner case of futexes: the WAITERS bit became stale. Now the proposed change - if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) { + if (unlikely(ownerdied || + !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) { solves the problem, but it's not obvious why and it wreckages the "stale WAITERS bit" case. What happens is, that due to the WAITERS bit being set (T2 is blocked on that futex) it enforces T3 to go through lookup_pi_state(), which in the above case returns an existing pi_state and therefor forces T3 to legitimately fight with T2 over the ownership of the pi_state (via pi_state->mutex). Probelm solved! Though that does not work for the "WAITERS bit is stale" problem because if lookup_pi_state() does not find existing pi_state it returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to return -ESRCH to user space because the OWNER_DIED bit is not set. Now there is a different solution to that problem. Do not look at the user space value at all and enforce a lookup of possibly available pi_state. If pi_state can be found, then the new incoming locker T3 blocks on that pi_state and legitimately races with T2 to acquire the rt_mutex and the pi_state and therefor the proper ownership of the user space futex. lookup_pi_state() has the correct order of checks. It first tries to find a pi_state associated with the user space futex and only if that fails it checks for futex TID value = 0. If no pi_state is available nothing can create new state at that point because this happens with the hash bucket lock held. So the above scenario changes to: T1 lock_futex_pi(F); T2 lock_futex_pi(F); --> T2 blocks on the futex and creates pi_state which is associated to T1. T1 exits --> exit_robust_list() runs --> Futex F userspace value TID field is set to 0 and FUTEX_OWNER_DIED bit is set. T3 lock_futex_pi(F); --> Finds pi_state and blocks on pi_state->rt_mutex T1 --> exit_pi_state_list() --> Transfers pi_state to waiter T2 and wakes it via rt_mutex_unlock(&pi_state->mutex) T2 --> acquires pi_state->mutex and gains ownership of the pi_state --> Claims ownership of the futex and sets its own TID into the userspace TID field of futex F --> returns to user space This covers all gazillion points on which T3 might come in between T1's exit_robust_list() clearing the TID field and T2 fixing it up. It also solves the "WAITERS bit stale" problem by forcing the take over. Another benefit of changing the code this way is that it makes it less dependent on untrusted user space values and therefor minimizes the possible wreckage which might be inflicted. As usual after staring for too long at the futex code my brain hurts so much that I really want to ditch that whole optimization of avoiding the syscall for the non contended case for PI futexes and rip out the maze of corner case handling code. Unfortunately we can't as user space relies on that existing behaviour, but at least thinking about it helps me to preserve my mental sanity. Maybe we should nevertheless :) Reported-and-tested-by: Siddhesh Poyarekar <[email protected]> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionos Acked-by: Darren Hart <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>