aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2015-07-17rcu: Add stall warnings to synchronize_sched_expedited()Paul E. McKenney2-4/+55
Although synchronize_sched_expedited() historically has no RCU CPU stall warnings, the availability of the rcupdate.rcu_expedited boot parameter invalidates the old assumption that synchronize_sched()'s stall warnings would suffice. This commit therefore adds RCU CPU stall warnings to synchronize_sched_expedited(). Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Extend expedited funnel locking to rcu_data structurePaul E. McKenney3-5/+21
The strictly rcu_node based funnel-locking scheme works well in many cases, but systems with CONFIG_RCU_FANOUT_LEAF=64 won't necessarily get all that much concurrency. This commit therefore extends the funnel locking into the per-CPU rcu_data structure, providing concurrency equal to the number of CPUs. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Consolidate last open-coded expedited memory barrierPaul E. McKenney2-2/+1
One of the requirements on RCU grace periods is that if there is a causal chain of operations that starts after one grace period and ends before another grace period, then the two grace periods must be serialized. There has been (and might still be) code that relies on this, for example, certain types of reference-counting code that does a call_rcu() within an RCU callback function. This requirement is why there is an smp_mb() at the end of both synchronize_sched_expedited() and synchronize_rcu_expedited(). However, this is the only smp_mb() in these functions, so it would be nicer to consolidate it into rcu_exp_gp_seq_end(). This commit does just that. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Apply rcu_seq operations to _rcu_barrier()Paul E. McKenney3-56/+22
The rcu_seq operations were open-coded in _rcu_barrier(), so this commit replaces the open-coding with the shiny new rcu_seq operations. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Use funnel locking for synchronize_rcu_expedited()'s polling loopPaul E. McKenney2-33/+18
This commit gets rid of synchronize_rcu_expedited()'s mutex_trylock() polling loop in favor of the funnel-locking scheme that was abstracted from synchronize_sched_expedited(). Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Fix synchronize_sched_expedited() type error for "s"Paul E. McKenney1-1/+1
The type of "s" has been "long" rather than the correct "unsigned long" for quite some time. This commit fixes this type error. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Abstract funnel locking from synchronize_sched_expedited()Paul E. McKenney1-33/+47
This commit abstracts funnel locking from synchronize_sched_expedited() so that it may be used by synchronize_rcu_expedited(). Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Make synchronize_rcu_expedited() use sequence-counter schemePaul E. McKenney1-10/+6
Although synchronize_rcu_expedited() uses a sequence-counter scheme, it is based on a single increment per grace period, which means that tasks piggybacking off of concurrent grace periods may be forced to wait longer than necessary. This commit therefore applies the new sequence-count functions developed for synchronize_sched_expedited() to speed things up a bit and to consolidate the sequence-counter implementation. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Abstract sequence counting from synchronize_sched_expedited()Paul E. McKenney1-10/+58
This commit creates rcu_exp_gp_seq_start() and rcu_exp_gp_seq_end() to bracket an expedited grace period, rcu_exp_gp_seq_snap() to snapshot the sequence counter, and rcu_exp_gp_seq_done() to check to see if a full expedited grace period has elapsed since the snapshot. These will be applied to synchronize_rcu_expedited(). These are defined in terms of underlying rcu_seq_start(), rcu_seq_end(), rcu_seq_snap(), rcu_seq_done(), which will be applied to _rcu_barrier(). One reason that this commit doesn't use the seqcount primitives themselves is that the smp_wmb() in those primitive is insufficient due to the fact that expedited grace periods do reads as well as writes. In addition, the read-side seqcount primitives detect a potentially partial change, where the expedited primitives instead need a guaranteed full change. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Make expedited GP CPU stoppage asynchronousPeter Zijlstra3-15/+25
Sequentially stopping the CPUs slows down expedited grace periods by at least a factor of two, based on rcutorture's grace-period-per-second rate. This is a conservative measure because rcutorture uses unusually long RCU read-side critical sections and because rcutorture periodically quiesces the system in order to test RCU's ability to ramp down to and up from the idle state. This commit therefore replaces the stop_one_cpu() with stop_one_cpu_nowait(), using an atomic-counter scheme to determine when all CPUs have passed through the stopped state. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Get rid of synchronize_sched_expedited()'s polling loopPaul E. McKenney3-59/+47
This commit gets rid of synchronize_sched_expedited()'s mutex_trylock() polling loop in favor of a funnel-locking scheme based on the rcu_node tree. The work-done check is done at each level of the tree, allowing high-contention situations to be resolved quickly with reasonable levels of mutex contention. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Rework synchronize_sched_expedited() counter handlingPaul E. McKenney3-83/+36
Now that synchronize_sched_expedited() have a mutex, it can use simpler work-already-done detection scheme. This commit simplifies this scheme by using something similar to the sequence-locking counter scheme. A counter is incremented before and after each grace period, so that the counter is odd in the midst of the grace period and even otherwise. So if the counter has advanced to the second even number that is greater than or equal to the snapshot, the required grace period has already happened. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Switch synchronize_sched_expedited() to stop_one_cpu()Peter Zijlstra2-27/+15
The synchronize_sched_expedited() currently invokes try_stop_cpus(), which schedules the stopper kthreads on each online non-idle CPU, and waits until all those kthreads are running before letting any of them stop. This is disastrous for real-time workloads, which get hit with a preemption that is as long as the longest scheduling latency on any CPU, including any non-realtime housekeeping CPUs. This commit therefore switches to using stop_one_cpu() on each CPU in turn. This avoids inflicting the worst-case scheduling latency on the worst-case CPU onto all other CPUs, and also simplifies the code a little bit. Follow-up commits will simplify the counter-snapshotting algorithm and convert a number of the counters that are now protected by the new ->expedited_mutex to non-atomic. Signed-off-by: Peter Zijlstra <[email protected]> [ paulmck: Kept stop_one_cpu(), dropped disabling of "guardrails". ] Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Remove CONFIG_RCU_CPU_STALL_INFOPaul E. McKenney2-49/+0
The CONFIG_RCU_CPU_STALL_INFO has been default-y for a couple of releases with no complaints, so it is time to eliminate this Kconfig option entirely, so that the long-form RCU CPU stall warnings cannot be disabled. This commit does just that. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Stop disabling CPU hotplug in synchronize_rcu_expedited()Paul E. McKenney1-23/+2
The fact that tasks could be migrated from leaf to root rcu_node structures meant that synchronize_rcu_expedited() had to disable CPU hotplug. However, tasks now stay put, so this commit removes the CPU-hotplug disabling from synchronize_rcu_expedited(). Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Reset rcu_fanout_leaf if out of boundsPaul E. McKenney1-0/+1
Currently if the rcu_fanout_leaf boot parameter is out of bounds (that is, less than RCU_FANOUT_LEAF or greater than the number of bits in an unsigned long), a warning is issued and execution continues with the out-of-bounds value. This can result in all manner of failures, so this patch resets rcu_fanout_leaf to RCU_FANOUT_LEAF when an out-of-bounds condition is detected. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17rcu: Shut up bogus gcc array bounds warningAlexander Gordeev1-1/+3
Because gcc does not realize a loop would not be entered ever (i.e. in case of rcu_num_lvls == 1): for (i = 1; i < rcu_num_lvls; i++) rsp->level[i] = rsp->level[i - 1] + levelcnt[i - 1]; some compiler (pre- 5.x?) versions give a bogus warning: kernel/rcu/tree.c: In function ‘rcu_init_one.isra.55’: kernel/rcu/tree.c:4108:13: warning: array subscript is above array bounds [-Warray-bounds] rsp->level[i] = rsp->level[i - 1] + rsp->levelcnt[i - 1]; ^ Fix that warning by adding an extra item to rcu_state::level[] array. Once the bogus warning is fixed in gcc and kernel drops support of older versions, the dummy item may be removed from the array. Cc: "Paul E. McKenney" <[email protected]> Suggested-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-17genirq: Prevent resend to interrupts marked IRQ_NESTED_THREADThomas Gleixner1-5/+13
The resend mechanism happily calls the interrupt handler of interrupts which are marked IRQ_NESTED_THREAD from softirq context. This can result in crashes because the interrupt handler is not the proper way to invoke the device handlers. They must be invoked via handle_nested_irq. Prevent the resend even if the interrupt has no valid parent irq set. Its better to have a lost interrupt than a crashing machine. Reported-by: Uwe Kleine-König <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected]
2015-07-15rcutorture: Add RCU-tasks qualifier to dereferencePaul E. McKenney1-2/+14
Although RCU-tasks isn't really designed to support rcu_dereference() and list manipulation, that is how rcutorture tests it. Which means that lockdep-RCU complains about the rcu_dereference_check() invocations because RCU-tasks doesn't have read-side markers. This commit therefore creates a torturing_tasks() to silence the lockdep-RCU complaints from rcu_dereference_check() when RCU-tasks is being tortured. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcutorture: Fix rcu_torture_cbflood() for callback-free RCUPaul E. McKenney1-3/+2
The rcu_torture_cbflood() function correctly checks for flavors of RCU that lack analogs to call_rcu() and rcu_barrier(), but in that case it fails to terminate correctly. In fact, it terminates so incorrectly that segfaults can result. This commit therefore causes rcu_torture_cbflood() to do the proper wait-for-stop procedure. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcutorture: Bounds-check rcutorture.shuffle_intervalPaul E. McKenney1-1/+1
Specifying a negative rcutorture.shuffle_interval value will cause a negative value to be used as a sleep time. This commit therefore refuses to start shuffling unless the rcutorture.shuffle_interval value is greater than zero. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcutorture: Check nfakewriters parameterPaul E. McKenney1-6/+9
Currently, a negative value for rcutorture.nfakewriters= can cause rcutorture to pass a negative size to the memory allocator, which is not really a particularly good thing to do. This commit therefore adds bounds checking to this parameter, so that values that are less than or equal to zero disable fake writing. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcutorture: Better bounds checking for n_barrier_cbsPaul E. McKenney1-1/+1
A negative value for rcutorture.n_barrier_cbs can pass a negative value to the memory allocator, so this commit instead causes rcu_barrier() testing to be disabled in this case. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Simplify arithmetic to calculate number of RCU nodesAlexander Gordeev2-15/+6
This update makes arithmetic to calculate number of RCU nodes more straight and easy to read. Cc: "Paul E. McKenney" <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Limit count of static data to the number of RCU levelsAlexander Gordeev2-17/+16
Although a number of RCU levels may be less than the current maximum of four, some static data associated with each level are allocated for all four levels. As result, the extra data never get accessed and just wast memory. This update limits count of allocated items to the number of used RCU levels. Cc: "Paul E. McKenney" <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Remove unnecessary fields from rcu_state structureAlexander Gordeev2-14/+15
Members rcu_state::levelcnt[] and rcu_state::levelspread[] are only used at init. There is no reason to keep them afterwards. Cc: "Paul E. McKenney" <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Limit rcu_capacity[] size to RCU_NUM_LVLS itemsAlexander Gordeev2-8/+6
Number of items in rcu_capacity[] array is defined by macro MAX_RCU_LVLS. However, that array is never accessed beyond RCU_NUM_LVLS index. Therefore, we can limit the array to RCU_NUM_LVLS items and eliminate MAX_RCU_LVLS. As result, in most cases the memory is conserved. Cc: "Paul E. McKenney" <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Limit rcu_state::levelcnt[] to RCU_NUM_LVLS itemsAlexander Gordeev1-1/+1
Variable rcu_num_lvls is limited by RCU_NUM_LVLS macro. In turn, rcu_state::levelcnt[] array is never accessed beyond rcu_num_lvls. Thus, rcu_state::levelcnt[] is safe to limit to RCU_NUM_LVLS items. Since rcu_num_lvls could be changed during boot (as result of rcutree.rcu_fanout_leaf kernel parameter update) one might assume a new value could overflow the value of RCU_NUM_LVLS. However, that is not the case, since leaf-level fanout is only permitted to increase, resulting in rcu_num_lvls possibly to decrease. Cc: "Paul E. McKenney" <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Simplify rcu_init_geometry() capacity arithmeticsAlexander Gordeev1-10/+8
Current code suggests that introducing the extra level to rcu_capacity[] array makes some of the arithmetic easier. Well, in fact it appears rather confusing and unnecessary. Cc: "Paul E. McKenney" <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Cleanup rcu_init_geometry() code and arithmeticsAlexander Gordeev1-14/+10
This update simplifies rcu_init_geometry() code flow and makes calculation of the total number of rcu_node structures more easy to read. The update relies on the fact num_rcu_lvl[] is never accessed beyond rcu_num_lvls index by the rest of the code. Therefore, there is no need initialize the whole num_rcu_lvl[]. Cc: "Paul E. McKenney" <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Remove superfluous local variable in rcu_init_geometry()Alexander Gordeev1-7/+7
Local variable 'n' mimics 'nr_cpu_ids' while the both are used within one function. There is no reason for 'n' to exist whatsoever. Cc: "Paul E. McKenney" <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Panic if RCU tree can not accommodate all CPUsAlexander Gordeev1-12/+17
Currently a condition when RCU tree is unable to accommodate the configured number of CPUs is not permitted and causes a fall back to compile-time values. However, the code has no means to exceed the RCU tree capacity neither at compile-time nor in run-time. Therefore, if the condition is met in run- time then it indicates a serios problem elsewhere and should be handled with a panic. Cc: "Paul E. McKenney" <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Provide more diagnostics for stalled GP kthreadPaul E. McKenney2-3/+13
Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Change return type to boolNicholas Mc Guire1-2/+3
Type-checking coccinelle spatches are being used to locate type mismatches between function signatures and return values in this case this produced: ./kernel/rcu/srcu.c:271 WARNING: return of wrong type int != unsigned long, srcu_readers_active() returns an int that is the sum of per_cpu unsigned long but the only user is cleanup_srcu_struct() which is using it as a boolean (condition) to see if there is any readers rather than actually using the approximate number of readers. The theoretically possible unsigned long overflow case does not need to be handled explicitly - if we had 4G++ readers then something else went wrong a long time ago. proposal: change the return type to boolean. The function name is left unchanged as it fits the naming expectation for a boolean. patch was compile tested for x86_64_defconfig (implies CONFIG_SRCU=y) patch is against 4.1-rc5 (localversion-next is -next-20150525) Signed-off-by: Nicholas Mc Guire <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15rcu: Deinline rcu_read_lock_sched_held() if DEBUG_LOCK_ALLOCDenys Vlasenko1-0/+49
DEBUG_LOCK_ALLOC=y is not a production setting, but it is not very unusual either. Many developers routinely use kernels built with it enabled. Apart from being selected by hand, it is also auto-selected by PROVE_LOCKING "Lock debugging: prove locking correctness" and LOCK_STAT "Lock usage statistics" config options. LOCK STAT is necessary for "perf lock" to work. I wouldn't spend too much time optimizing it, but this particular function has a very large cost in code size: when it is deinlined, code size decreases by 830,000 bytes: text data bss dec hex filename 85674192 22294776 20627456 128596424 7aa39c8 vmlinux.before 84837612 22294424 20627456 127759492 79d7484 vmlinux (with this config: http://busybox.net/~vda/kernel_config) Signed-off-by: Denys Vlasenko <[email protected]> CC: "Paul E. McKenney" <[email protected]> CC: Josh Triplett <[email protected]> CC: Mathieu Desnoyers <[email protected]> CC: Lai Jiangshan <[email protected]> CC: Tejun Heo <[email protected]> CC: Oleg Nesterov <[email protected]> CC: [email protected] Reviewed-by: Steven Rostedt <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-15seccomp: swap hard-coded zeros to defined nameKees Cook1-1/+1
For clarity, if CONFIG_SECCOMP isn't defined, seccomp_mode() is returning "disabled". This makes that more clear, along with another 0-use, and results in no operational change. Signed-off-by: Kees Cook <[email protected]>
2015-07-15seccomp: add ptrace options for suspend/resumeTycho Andersen2-0/+21
This patch is the first step in enabling checkpoint/restore of processes with seccomp enabled. One of the things CRIU does while dumping tasks is inject code into them via ptrace to collect information that is only available to the process itself. However, if we are in a seccomp mode where these processes are prohibited from making these syscalls, then what CRIU does kills the task. This patch adds a new ptrace option, PTRACE_O_SUSPEND_SECCOMP, that enables a task from the init user namespace which has CAP_SYS_ADMIN and no seccomp filters to disable (and re-enable) seccomp filters for another task so that they can be successfully dumped (and restored). We restrict the set of processes that can disable seccomp through ptrace because although today ptrace can be used to bypass seccomp, there is some discussion of closing this loophole in the future and we would like this patch to not depend on that behavior and be future proofed for when it is removed. Note that seccomp can be suspended before any filters are actually installed; this behavior is useful on criu restore, so that we can suspend seccomp, restore the filters, unmap our restore code from the restored process' address space, and then resume the task by detaching and have the filters resumed as well. v2 changes: * require that the tracer have no seccomp filters installed * drop TIF_NOTSC manipulation from the patch * change from ptrace command to a ptrace option and use this ptrace option as the flag to check. This means that as soon as the tracer detaches/dies, seccomp is re-enabled and as a corrollary that one can not disable seccomp across PTRACE_ATTACHs. v3 changes: * get rid of various #ifdefs everywhere * report more sensible errors when PTRACE_O_SUSPEND_SECCOMP is incorrectly used v4 changes: * get rid of may_suspend_seccomp() in favor of a capable() check in ptrace directly v5 changes: * check that seccomp is not enabled (or suspended) on the tracer Signed-off-by: Tycho Andersen <[email protected]> CC: Will Drewry <[email protected]> CC: Roland McGrath <[email protected]> CC: Pavel Emelyanov <[email protected]> CC: Serge E. Hallyn <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: Andy Lutomirski <[email protected]> [kees: access seccomp.mode through seccomp_mode() instead] Signed-off-by: Kees Cook <[email protected]>
2015-07-15seccomp: Replace smp_read_barrier_depends() with lockless_dereference()Pranith Kumar1-4/+3
Recently lockless_dereference() was added which can be used in place of hard-coding smp_read_barrier_depends(). The following PATCH makes the change. Signed-off-by: Pranith Kumar <[email protected]> Signed-off-by: Kees Cook <[email protected]>
2015-07-15Merge tag 'trace-v4.2-rc1-fix' of ↵Linus Torvalds2-7/+11
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fengguang Wu discovered a crash that happened to be because of the branch tracer (traces unlikely and likely branches) when enabled with certain debug options. What happened was that various debug options like lockdep and DEBUG_PREEMPT can cause parts of the branch tracer to recurse outside its recursion protection. In fact, part of its recursion protection used these features that caused the lockup. This cleans up the code a little and makes the recursion protection a bit more robust" * tag 'trace-v4.2-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Have branch tracer use recursive field of task struct
2015-07-15genirq: Revert sparse irq locking around __cpu_up() and move it to x86 for nowThomas Gleixner1-9/+0
Boris reported that the sparse_irq protection around __cpu_up() in the generic code causes a regression on Xen. Xen allocates interrupts and some more in the xen_cpu_up() function, so it deadlocks on the sparse_irq_lock. There is no simple fix for this and we really should have the protection for all architectures, but for now the only solution is to move it to x86 where actual wreckage due to the lack of protection has been observed. Reported-and-tested-by: Boris Ostrovsky <[email protected]> Fixes: a89941816726 'hotplug: Prevent alloc/free of irq descriptors during cpu up/down' Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: xiao jin <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Yanmin Zhang <[email protected]> Cc: xen-devel <[email protected]>
2015-07-14cgroup: implement the PIDs subsystemAleksa Sarai2-0/+367
Adds a new single-purpose PIDs subsystem to limit the number of tasks that can be forked inside a cgroup. Essentially this is an implementation of RLIMIT_NPROC that applies to a cgroup rather than a process tree. However, it should be noted that organisational operations (adding and removing tasks from a PIDs hierarchy) will *not* be prevented. Rather, the number of tasks in the hierarchy cannot exceed the limit through forking. This is due to the fact that, in the unified hierarchy, attach cannot fail (and it is not possible for a task to overcome its PIDs cgroup policy limit by attaching to a child cgroup -- even if migrating mid-fork it must be able to fork in the parent first). PIDs are fundamentally a global resource, and it is possible to reach PID exhaustion inside a cgroup without hitting any reasonable kmemcg policy. Once you've hit PID exhaustion, you're only in a marginally better state than OOM. This subsystem allows PID exhaustion inside a cgroup to be prevented. Signed-off-by: Aleksa Sarai <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2015-07-14cgroup: allow a cgroup subsystem to reject a forkAleksa Sarai4-6/+88
Add a new cgroup subsystem callback can_fork that conditionally states whether or not the fork is accepted or rejected by a cgroup policy. In addition, add a cancel_fork callback so that if an error occurs later in the forking process, any state modified by can_fork can be reverted. Allow for a private opaque pointer to be passed from cgroup_can_fork to cgroup_post_fork, allowing for the fork state to be stored by each subsystem separately. Also add a tagging system for cgroup_subsys.h to allow for CGROUP_<TAG> enumerations to be be defined and used. In addition, explicitly add a CGROUP_CANFORK_COUNT macro to make arrays easier to define. This is in preparation for implementing the pids cgroup subsystem. Signed-off-by: Aleksa Sarai <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2015-07-14livepatch: Improve error handling in klp_disable_func()Minfei Huang1-2/+4
In case of func->state or func->old_addr not having expected values, we'd rather bail out immediately from klp_disable_func(). This can't really happen with the current codebase, but fix this anyway in the sake of robustness. [[email protected]: reworded the changelog a bit] Signed-off-by: Minfei Huang <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2015-07-14PM / autosleep: Use workqueue for user space wakeup sources garbage collectorSungEun Kim1-3/+15
The synchronous synchronize_rcu() in wakeup_source_remove() makes user process which writes to /sys/kernel/wake_unlock blocked sometimes. For example, when android eventhub tries to release a wakelock, this blocking process can occur, and eventhub can't get input events for a while. Using a work item instead of direct function call at pm_wake_unlock() can prevent this unnecessary delay from happening. Signed-off-by: SungEun Kim <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2015-07-14tick: Move the export of tick_broadcast_oneshot_control to the proper placeThomas Gleixner2-1/+1
tick_broadcast_oneshot_control got moved from tick-broadcast to tick-common, but the export stayed in the old place. Fix it up. Fixes: f32dd1170511 'tick/broadcast: Make idle check independent from mode and config' Reported-by: Ingo Molnar <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2015-07-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller19-130/+283
Conflicts: net/bridge/br_mdb.c Minor conflict in br_mdb.c, in 'net' we added a memset of the on-stack 'ip' variable whereas in 'net-next' we assign a new member 'vid'. Signed-off-by: David S. Miller <[email protected]>
2015-07-13ebpf: remove self-assignment in interpreter's tail callDaniel Borkmann1-1/+5
ARG1 = BPF_R1 as it stands, evaluates to regs[BPF_REG_1] = regs[BPF_REG_1] and thus has no effect. Add a comment instead, explaining what happens and why it's okay to just remove it. Since from user space side, a tail call is invoked as a pseudo helper function via bpf_tail_call_proto, the verifier checks the arguments just like with any other helper function and makes sure that the first argument (regs[BPF_REG_1])'s type is ARG_PTR_TO_CTX. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-07-12Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds4-70/+148
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "This update from the timer departement contains: - A series of patches which address a shortcoming in the tick broadcast code. If the broadcast device is not available or an hrtimer emulated broadcast device, some of the original assumptions lead to boot failures. I rather plugged all of the corner cases instead of only addressing the issue reported, so the change got a little larger. Has been extensivly tested on x86 and arm. - Get rid of the last holdouts using do_posix_clock_monotonic_gettime() - A regression fix for the imx clocksource driver - An update to the new state callbacks mechanism for clockevents. This is required to simplify the conversion, which will take place in 4.3" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/broadcast: Prevent NULL pointer dereference time: Get rid of do_posix_clock_monotonic_gettime cris: Replace do_posix_clock_monotonic_gettime() tick/broadcast: Unbreak CONFIG_GENERIC_CLOCKEVENTS=n build tick/broadcast: Handle spurious interrupts gracefully tick/broadcast: Check for hrtimer broadcast active early tick/broadcast: Return busy when IPI is pending tick/broadcast: Return busy if periodic mode and hrtimer broadcast tick/broadcast: Move the check for periodic mode inside state handling tick/broadcast: Prevent deep idle if no broadcast device available tick/broadcast: Make idle check independent from mode and config tick/broadcast: Sanity check the shutdown of the local clock_event tick/broadcast: Prevent hrtimer recursion clockevents: Allow set-state callbacks to be optional clocksource/imx: Define clocksource for mx27
2015-07-12Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds2-5/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "A single fix for a cpu hotplug race vs. interrupt descriptors: Prevent irq setup/teardown across the cpu starting/dying parts of cpu hotplug so that the starting/dying cpu has a stable view of the descriptor space. This has been an issue for all architectures in the cpu dying phase, where interrupts are migrated away from the dying cpu. In the starting phase its mostly a x86 issue vs the vector space update" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hotplug: Prevent alloc/free of irq descriptors during cpu up/down
2015-07-11genirq: Remove the irq argument from setup_affinity()Jiang Liu1-8/+7
Unused except for the alpha wrapper, which can retrieve if from the irq descriptor. Signed-off-by: Jiang Liu <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Tony Luck <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Borislav Petkov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>