aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2013-06-28timekeeping: Indicate that clock was set in the pvclock gtod notifierDavid Vrabel1-12/+18
If the clock was set (stepped), set the action parameter to functions in the pvclock gtod notifier chain to non-zero. This allows the callee to only do work if the clock was stepped. This will be used on Xen as the synchronization of the Xen wallclock to the control domain's (dom0) system time will be done with this notifier and updating on every timer tick is unnecessary and too expensive. Signed-off-by: David Vrabel <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: John Stultz <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-28timekeeping: Pass flags instead of multiple bools to timekeeping_update()David Vrabel1-9/+12
Instead of passing multiple bools to timekeeping_updated(), define flags and use a single 'action' parameter. It is then more obvious what each timekeeping_update() call does. Signed-off-by: David Vrabel <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: John Stultz <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-28hrtimers: Support resuming with two or more CPUs online (but stopped)David Vrabel1-3/+12
hrtimers_resume() only reprograms the timers for the current CPU as it assumes that all other CPUs are offline at this point in the resume process. If other CPUs are online then their timers will not be corrected and they may fire at the wrong time. When running as a Xen guest, this assumption is not true. Non-boot CPUs are only stopped with IRQs disabled instead of offlining them. This is a performance optimization as disabling the CPUs would add an unacceptable amount of additional downtime during a live migration (> 200 ms for a 4 VCPU guest). hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...) as the other CPUs will be stopped with IRQs disabled. Instead, defer the call to the next softirq. [ tglx: Separated the xen change out ] Signed-off-by: David Vrabel <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: John Stultz <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-28timer: Fix jiffies wrap behavior of round_jiffies_common()Bart Van Assche1-3/+5
Direct compare of jiffies related values does not work in the wrap around case. Replace it with time_is_after_jiffies(). Signed-off-by: Bart Van Assche <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Stephen Rothwell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-28softirq: Use _RET_IP_Davidlohr Bueso1-6/+4
Use the already defined macro to pass the function return address. Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Frederic Weisbecker <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-28sched/debug: Remove CONFIG_FAIR_GROUP_SCHED maskAlex Shi1-4/+6
Now that we are using runnable load avg in sched balance, we don't need to keep it under CONFIG_FAIR_GROUP_SCHED. Also align the code style to #ifdef instead of #if defined() and reorder the tg output info. Signed-off-by: Alex Shi <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-28Merge branch 'pm-assorted'Rafael J. Wysocki4-6/+17
* pm-assorted: PM / QoS: Add pm_qos and dev_pm_qos to events-power.txt PM / QoS: Add dev_pm_qos_request tracepoints PM / QoS: Add pm_qos_request tracepoints PM / QoS: Add pm_qos_update_target/flags tracepoints PM / QoS: Update Documentation/power/pm_qos_interface.txt PM / Sleep: Print last wakeup source on failed wakeup_count write PM / QoS: correct the valid range of pm_qos_class PM / wakeup: Adjust messaging for wake events during suspend PM / Runtime: Update .runtime_idle() callback documentation PM / Runtime: Rework the "runtime idle" helper routine PM / Hibernate: print physical addresses consistently with other parts of kernel
2013-06-28Merge branch 'freezer'Rafael J. Wysocki7-24/+41
* freezer: af_unix: use freezable blocking calls in read sigtimedwait: use freezable blocking call nanosleep: use freezable blocking call futex: use freezable blocking call select: use freezable blocking call epoll: use freezable blocking call binder: use freezable blocking calls freezer: add new freezable helpers using freezer_do_not_count() freezer: convert freezable helpers to static inline where possible freezer: convert freezable helpers to freezer_do_not_count() freezer: skip waking up tasks with PF_FREEZER_SKIP set freezer: shorten freezer sleep time using exponential backoff lockdep: check that no locks held at freeze time lockdep: remove task argument from debug_check_no_locks_held freezer: add unsafe versions of freezable helpers for CIFS freezer: add unsafe versions of freezable helpers for NFS
2013-06-28genirq: Add the generic chip to the genirq docbookThomas Gleixner1-5/+6
Signed-off-by: Thomas Gleixner <[email protected]> Cc: Randy Dunlap <[email protected]>
2013-06-28genirq: generic-chip: Export some irq_gc_ functionsFabio Estevam1-0/+3
When building imx_v6_v7_defconfig with imx-drm drivers selected as modules, we get the following build errors: ERROR: "irq_gc_mask_clr_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined! ERROR: "irq_gc_mask_set_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined! ERROR: "irq_gc_ack_set_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined! Export the required functions to avoid this problem. Signed-off-by: Fabio Estevam <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-28genirq: Fix can_request_irq() for IRQs without an actionBen Hutchings1-3/+3
Commit 02725e7471b8 ('genirq: Use irq_get/put functions'), inadvertently changed can_request_irq() to return 0 for IRQs that have no action. This causes pcibios_lookup_irq() to select only IRQs that already have an action with IRQF_SHARED set, or to fail if there are none. Change can_request_irq() to return 1 for IRQs that have no action (if the first two conditions are met). Reported-by: Bjarni Ingi Gislason <[email protected]> Tested-by: Bjarni Ingi Gislason <[email protected]> (against 3.2) Signed-off-by: Ben Hutchings <[email protected]> Cc: [email protected] Cc: [email protected] # 2.6.39+ Link: http://bugs.debian.org/709647 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-28sched/debug: Fix formatting of /proc/<PID>/schedKamalesh Babulal1-8/+9
This patch alters format string's width, to align all statistics at par with the longest struct sched_statistic member name under /proc/<PID>/sched. Signed-off-by: Kamalesh Babulal <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount optionsTejun Heo1-1/+4
1672d04070 ("cgroup: fix cgroupfs_root early destruction path") introduced CGRP_ROOT_SUBSYS_BOUND which is used to mark completion of subsys binding on a new root; however, this broke remounts. cgroup_remount() doesn't allow changing root options via remount and CGRP_ROOT_SUBSYS_BOUND, which is set on all fully initialized roots, makes the function reject all remounts. Fix it by putting the options part in the lower 16 bits of root->flags and masking the comparions. While at it, make cgroup_remount() emit an error message explaining why it's rejecting a remount request, so that it's less of a mystery. Signed-off-by: Tejun Heo <[email protected]>
2013-06-27cgroup: fix deadlock on cgroup_mutex via drop_parsed_module_refcounts()Tejun Heo1-2/+2
eb178d06332 ("cgroup: grab cgroup_mutex in drop_parsed_module_refcounts()") made drop_parsed_module_refcounts() grab cgroup_mutex to make lockdep assertion in for_each_subsys() happy. Unfortunately, cgroup_remount() calls the function while holding cgroup_mutex in its failure path leading to the following deadlock. # mount -t cgroup -o remount,memory,blkio cgroup blkio cgroup: option changes via remount are deprecated (pid=525 comm=mount) ============================================= [ INFO: possible recursive locking detected ] 3.10.0-rc4-work+ #1 Not tainted --------------------------------------------- mount/525 is trying to acquire lock: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0 but task is already holding lock: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(cgroup_mutex); lock(cgroup_mutex); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by mount/525: #0: (&type->s_umount_key#30){+.+...}, at: [<ffffffff811e9a0d>] do_mount+0x2bd/0xa30 #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff8110e4d3>] cgroup_remount+0x43/0x200 #2: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200 #3: (cgroup_root_mutex){+.+.+.}, at: [<ffffffff8110e4ef>] cgroup_remount+0x5f/0x200 stack backtrace: CPU: 2 PID: 525 Comm: mount Not tainted 3.10.0-rc4-work+ #1 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 ffffffff829651f0 ffff88000ec2fc28 ffffffff81c24bb1 ffff88000ec2fce8 ffffffff810f420d 0000000000000006 0000000000000001 0000000000000056 ffff8800153b4640 ffff880000000000 ffffffff81c2e468 ffff8800153b4640 Call Trace: [<ffffffff81c24bb1>] dump_stack+0x19/0x1b [<ffffffff810f420d>] __lock_acquire+0x15dd/0x1e60 [<ffffffff810f531c>] lock_acquire+0x9c/0x1f0 [<ffffffff81c2a805>] mutex_lock_nested+0x65/0x410 [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0 [<ffffffff8110e63e>] cgroup_remount+0x1ae/0x200 [<ffffffff811c9bb2>] do_remount_sb+0x82/0x190 [<ffffffff811e9d41>] do_mount+0x5f1/0xa30 [<ffffffff811ea203>] SyS_mount+0x83/0xc0 [<ffffffff81c2fb82>] system_call_fastpath+0x16/0x1b Fix it by moving the drop_parsed_module_refcounts() invocation outside cgroup_mutex. Signed-off-by: Tejun Heo <[email protected]>
2013-06-27PM / Sleep: Warn about system time after resume with pm_traceShuah Khan1-0/+4
pm_trace uses the system's Real Time Clock (RTC) to save the magic number. The reason for this is that the RTC is the only reliably available piece of hardware during resume operations where a value can be set that will survive a reboot. Consequence is that after a resume (even if it is successful) your system clock will have a value corresponding to the magic number instead of the correct date/time! It is therefore advisable to use a program like ntp-date or rdate to reset the correct date/time from an external time source when using this trace option. There is no run-time message to warn users of the consequences of enabling pm_trace. Adding a warning message to pm_trace_store() will serve as a reminder to users to set the system date and time after resume. Signed-off-by: Shuah Khan <[email protected]> Acked-by: Pavel Machek <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2013-06-27sched/fair: Fix typo describing flags in enqueue_entityKamalesh Babulal1-1/+1
Fix spelling of 'calling' in description of se flags in enqueue_entity(). Signed-off-by: Kamalesh Babulal <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched/debug: Add load-tracking statistics to taskKamalesh Babulal1-0/+6
At present we print per-entity load-tracking statistics for cfs_rq of cgroups/runqueues. Given that per task statistics is maintained, it can be used to know the contribution made by the task to its parenting cfs_rq level. This patch adds per-task load-tracking statistics to /proc/<PID>/sched. Signed-off-by: Kamalesh Babulal <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched: Change get_rq_runnable_load() to static and inlineAlex Shi1-2/+2
Based-on-patch-by: Fengguang Wu <[email protected]> Signed-off-by: Alex Shi <[email protected]> Tested-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched/tg: Remove tg.load_weightAlex Shi1-1/+0
Since no one use it. Signed-off-by: Alex Shi <[email protected]> Reviewed-by: Paul Turner <[email protected]> Tested-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched/cfs_rq: Change atomic64_t removed_load to atomic_long_tAlex Shi2-5/+8
Similar to runnable_load_avg, blocked_load_avg variable, long type is enough for removed_load in 64 bit or 32 bit machine. Then we avoid the expensive atomic64 operations on 32 bit machine. Signed-off-by: Alex Shi <[email protected]> Reviewed-by: Paul Turner <[email protected]> Tested-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched/tg: Use 'unsigned long' for load variable in task groupAlex Shi3-11/+11
Since tg->load_avg is smaller than tg->load_weight, we don't need a atomic64_t variable for load_avg in 32 bit machine. The same reason for cfs_rq->tg_load_contrib. The atomic_long_t/unsigned long variable type are more efficient and convenience for them. Signed-off-by: Alex Shi <[email protected]> Tested-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched: Change cfs_rq load avg to unsigned longAlex Shi3-8/+5
Since the 'u64 runnable_load_avg, blocked_load_avg' in cfs_rq struct are smaller than 'unsigned long' cfs_rq->load.weight. We don't need u64 vaiables to describe them. unsigned long is more efficient and convenience. Signed-off-by: Alex Shi <[email protected]> Reviewed-by: Paul Turner <[email protected]> Tested-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched: Consider runnable load average in move_tasks()Alex Shi1-9/+9
Aside from using runnable load average in background, move_tasks is also the key function in load balance. We need consider the runnable load average in it in order to make it an apple to apple load comparison. Morten had caught a div u64 bug on ARM, thanks! Thanks-to: Morten Rasmussen <[email protected]> Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_taskAlex Shi2-4/+18
They are the base values in load balance, update them with rq runnable load average, then the load balance will consider runnable load avg naturally. We also try to include the blocked_load_avg as cpu load in balancing, but that cause kbuild performance drop 6% on every Intel machine, and aim7/oltp drop on some of 4 CPU sockets machines. Or only add blocked_load_avg into get_rq_runable_load, hackbench still drop a little on NHM EX. Signed-off-by: Alex Shi <[email protected]> Reviewed-by: Gu Zheng <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched: Update cpu load after task_tickAlex Shi1-1/+1
To get the latest runnable info, we need do this cpuload update after task_tick. Signed-off-by: Alex Shi <[email protected]> Reviewed-by: Paul Turner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched: Fix sleep time double accounting in enqueue entityAlex Shi1-1/+7
The woken migrated task will __synchronize_entity_decay(se); in migrate_task_rq_fair, then it needs to set `se->avg.last_runnable_update -= (-se->avg.decay_count) << 20' before update_entity_load_avg, in order to avoid sleep time is updated twice for se.avg.load_avg_contrib in both __syncchronize and update_entity_load_avg. However if the sleeping task is woken up from the same cpu, it miss the last_runnable_update before update_entity_load_avg(se, 0, 1), then the sleep time was used twice in both functions. So we need to remove the double sleep time accounting. Paul also contributed some code comments in this commit. Signed-off-by: Alex Shi <[email protected]> Reviewed-by: Paul Turner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched: Set an initial value of runnable avg for new forked taskAlex Shi3-4/+28
We need to initialize the se.avg.{decay_count, load_avg_contrib} for a new forked task. Otherwise random values of above variables cause a mess when a new task is enqueued: enqueue_task_fair enqueue_entity enqueue_entity_load_avg and make fork balancing imbalance due to incorrect load_avg_contrib. Further more, Morten Rasmussen notice some tasks were not launched at once after created. So Paul and Peter suggest giving a start value for new task runnable avg time same as sched_slice(). PeterZ said: > So the 'problem' is that our running avg is a 'floating' average; ie. it > decays with time. Now we have to guess about the future of our newly > spawned task -- something that is nigh impossible seeing these CPU > vendors keep refusing to implement the crystal ball instruction. > > So there's two asymptotic cases we want to deal well with; 1) the case > where the newly spawned program will be 'nearly' idle for its lifetime; > and 2) the case where its cpu-bound. > > Since we have to guess, we'll go for worst case and assume its > cpu-bound; now we don't want to make the avg so heavy adjusting to the > near-idle case takes forever. We want to be able to quickly adjust and > lower our running avg. > > Now we also don't want to make our avg too light, such that it gets > decremented just for the new task not having had a chance to run yet -- > even if when it would run, it would be more cpu-bound than not. > > So what we do is we make the initial avg of the same duration as that we > guess it takes to run each task on the system at least once -- aka > sched_slice(). > > Of course we can defeat this with wakeup/fork bombs, but in the 'normal' > case it should be good enough. Paul also contributed most of the code comments in this commit. Signed-off-by: Alex Shi <[email protected]> Reviewed-by: Gu Zheng <[email protected]> Reviewed-by: Paul Turner <[email protected]> [peterz; added explanation of sched_slice() usage] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27sched: Move a few runnable tg variables into CONFIG_SMPAlex Shi1-0/+2
The following 2 variables are only used under CONFIG_SMP, so its better to move their definiation into CONFIG_SMP too. atomic64_t load_avg; atomic_t runnable_avg; Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-27Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for ↵Alex Shi3-36/+7
load-tracking" Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then we can use runnable load variables. Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted patch(introduced in 9ee474f), but also need to revert. Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-26Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Three small fixlets" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hw_breakpoint: Use cpu_possible_mask in {reserve,release}_bp_slot() hw_breakpoint: Fix cpu check in task_bp_pinned(cpu) kprobes: Fix arch_prepare_kprobe to handle copy insn failures
2013-06-26cgroup: always use RCU accessors for protected accessesTejun Heo1-9/+12
kernel/cgroup.c still has places where a RCU pointer is set and accessed directly without going through RCU_INIT_POINTER() or rcu_dereference_protected(). They're all properly protected accesses so nothing is broken but it leads to spurious sparse RCU address space warnings. Substitute direct accesses with RCU_INIT_POINTER() and rcu_dereference_protected(). Note that %true is specified as the extra condition for all derference updates. This isn't ideal as all it does is suppressing warning without actually policing synchronization rules; however, most are scheduled to be removed pretty soon along with css_id itself, so no reason to be more elaborate. Combined with the previous changes, this removes all RCU related sparse warnings from cgroup. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Fengguang Wu <[email protected]> Acked-by; Li Zefan <[email protected]>
2013-06-26cgroup: fix RCU accesses around task->cgroupsTejun Heo1-11/+13
There are several places in kernel/cgroup.c where task->cgroups is accessed and modified without going through proper RCU accessors. None is broken as they're all lock protected accesses; however, this still triggers sparse RCU address space warnings. * Consistently use task_css_set() for task->cgroups dereferencing. * Use RCU_INIT_POINTER() to clear task->cgroups to &init_css_set on exit. * Remove unnecessary rcu_dereference_raw() from cset->subsys[] dereference in cgroup_exit(). Signed-off-by: Tejun Heo <[email protected]> Reported-by: Fengguang Wu <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-06-26cgroup: grab cgroup_mutex in drop_parsed_module_refcounts()Tejun Heo1-5/+5
This isn't strictly necessary as all subsystems specified in @subsys_mask are guaranteed to be pinned; however, it does spuriously trigger lockdep warning. Let's grab cgroup_mutex around it. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-06-26cgroup: fix cgroupfs_root early destruction pathTejun Heo1-3/+19
cgroupfs_root used to have ->actual_subsys_mask in addition to ->subsys_mask. a8a648c4ac ("cgroup: remove cgroup->actual_subsys_mask") removed it noting that the subsys_mask is essentially temporary and doesn't belong in cgroupfs_root; however, the patch made it impossible to tell whether a cgroupfs_root actually has the subsystems bound or just have the bits set leading to the following BUG when trying to mount with subsystems which are already mounted elsewhere. kernel BUG at kernel/cgroup.c:1038! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ... CPU: 1 PID: 7973 Comm: mount Tainted: G W 3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105 task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000 RIP: 0010:[<ffffffff81249b29>] [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0 ... Call Trace: [<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210 [<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90 [<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0 [<ffffffff81257169>] cpuset_mount+0xd9/0x110 [<ffffffff813d2580>] mount_fs+0xb0/0x2d0 [<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180 [<ffffffff814070b5>] do_new_mount+0x145/0x2c0 [<ffffffff814085d6>] do_mount+0x356/0x3c0 [<ffffffff8140873d>] SyS_mount+0xfd/0x140 [<ffffffff854eb600>] tracesys+0xdd/0xe2 We still want rebind_subsystems() to take added/removed masks, so let's fix it by marking whether a cgroupfs_root has finished binding or not. Also, document what's going on around ->subsys_mask initialization so that similar mistakes aren't repeated. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Sasha Levin <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-06-26mutex: Add w/w mutex slowpath debuggingDaniel Vetter1-3/+41
Injects EDEADLK conditions at pseudo-random interval, with exponential backoff up to UINT_MAX (to ensure that every lock operation still completes in a reasonable time). This way we can test the wound slowpath even for ww mutex users where contention is never expected, and the ww deadlock avoidance algorithm is only needed for correctness against malicious userspace. An example would be protecting kernel modesetting properties, which thanks to single-threaded X isn't really expected to contend, ever. I've looked into using the CONFIG_FAULT_INJECTION infrastructure, but decided against it for two reasons: - EDEADLK handling is mandatory for ww mutex users and should never affect the outcome of a syscall. This is in contrast to -ENOMEM injection. So fine configurability isn't required. - The fault injection framework only allows to set a simple probability for failure. Now the probability that a ww mutex acquire stage with N locks will never complete (due to too many injected EDEADLK backoffs) is zero. But the expected number of ww_mutex_lock operations for the completely uncontended case would be O(exp(N)). The per-acuiqire ctx exponential backoff solution choosen here only results in O(log N) overhead due to injection and so O(log N * N) lock operations. This way we can fail with high probability (and so have good test coverage even for fancy backoff and lock acquisition paths) without running into patalogical cases. Note that EDEADLK will only ever be injected when we managed to acquire the lock. This prevents any behaviour changes for users which rely on the EALREADY semantics. Signed-off-by: Daniel Vetter <[email protected]> Signed-off-by: Maarten Lankhorst <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/20130620113117.4001.21681.stgit@patser Signed-off-by: Ingo Molnar <[email protected]>
2013-06-26mutex: Add support for wound/wait style locksMaarten Lankhorst1-16/+302
Wound/wait mutexes are used when other multiple lock acquisitions of a similar type can be done in an arbitrary order. The deadlock handling used here is called wait/wound in the RDBMS literature: The older tasks waits until it can acquire the contended lock. The younger tasks needs to back off and drop all the locks it is currently holding, i.e. the younger task is wounded. For full documentation please read Documentation/ww-mutex-design.txt. References: https://lwn.net/Articles/548909/ Signed-off-by: Maarten Lankhorst <[email protected]> Acked-by: Daniel Vetter <[email protected]> Acked-by: Rob Clark <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-26arch: Make __mutex_fastpath_lock_retval return whether fastpath succeeded or notMaarten Lankhorst1-18/+14
This will allow me to call functions that have multiple arguments if fastpath fails. This is required to support ticket mutexes, because they need to be able to pass an extra argument to the fail function. Originally I duplicated the functions, by adding __mutex_fastpath_lock_retval_arg. This ended up being just a duplication of the existing function, so a way to test if fastpath was called ended up being better. This also cleaned up the reservation mutex patch some by being able to call an atomic_set instead of atomic_xchg, and making it easier to detect if the wrong unlock function was previously used. Signed-off-by: Maarten Lankhorst <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/20130620113105.4001.83929.stgit@patser Signed-off-by: Ingo Molnar <[email protected]>
2013-06-26Merge tag 'please-pull-mce-bitmap-comment' of ↵Ingo Molnar10-106/+275
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras Pull MCE updates from Tony Luck: "Better comments so we understand our existing machine check bank bitmaps - prelude to adding another bitmap soon." Signed-off-by: Ingo Molnar <[email protected]>
2013-06-25futex: Use freezable blocking callColin Cross1-1/+2
Avoid waking up every thread sleeping in a futex_wait call during suspend and resume by calling a freezable blocking call. Previous patches modified the freezer to avoid sending wakeups to threads that are blocked in freezable blocking calls. This call was selected to be converted to a freezable call because it doesn't hold any locks or release any resources when interrupted that might be needed by another freezing task or a kernel driver during suspend, and is a common site where idle userspace tasks are blocked. Signed-off-by: Colin Cross <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: [email protected] Cc: Tejun Heo <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Darren Hart <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Al Viro <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-25futex: Take hugepages into account when generating futex_keyZhang Yi1-1/+2
The futex_keys of process shared futexes are generated from the page offset, the mapping host and the mapping index of the futex user space address. This should result in an unique identifier for each futex. Though this is not true when futexes are located in different subpages of an hugepage. The reason is, that the mapping index for all those futexes evaluates to the index of the base page of the hugetlbfs mapping. So a futex at offset 0 of the hugepage mapping and another one at offset PAGE_SIZE of the same hugepage mapping have identical futex_keys. This happens because the futex code blindly uses page->index. Steps to reproduce the bug: 1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0 and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs mapping. The mutexes must be initialized as PTHREAD_PROCESS_SHARED because PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as their keys solely depend on the user space address. 2. Lock mutex1 and mutex2 3. Create thread1 and in the thread function lock mutex1, which results in thread1 blocking on the locked mutex1. 4. Create thread2 and in the thread function lock mutex2, which results in thread2 blocking on the locked mutex2. 5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2 still blocks on mutex2 because the futex_key points to mutex1. To solve this issue we need to take the normal page index of the page which contains the futex into account, if the futex is in an hugetlbfs mapping. In other words, we calculate the normal page mapping index of the subpage in the hugetlbfs mapping. Mappings which are not based on hugetlbfs are not affected and still use page->index. Thanks to Mel Gorman who provided a patch for adding proper evaluation functions to the hugetlbfs code to avoid exposing hugetlbfs specific details to the futex code. [ tglx: Massaged changelog ] Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Jiang Biao <[email protected]> Tested-by: Ma Chenggong <[email protected]> Reviewed-by: 'Mel Gorman' <[email protected]> Acked-by: 'Darren Hart' <[email protected]> Cc: 'Peter Zijlstra' <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-25cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchyTejun Heo1-4/+6
Before 1a57423166 ("cgroup: make hierarchy_id use cyclic idr"), hierarchy IDs were allocated from 0. As the dummy hierarchy was always the one first initialized, it got assigned 0 and all other hierarchies from 1. The patch accidentally changed the minimum useable ID to 2. Let's restore ID 0 for dummy_root and while at it reserve 1 for unified hierarchy. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: [email protected]
2013-06-25cgroup: implement for_each_[builtin_]subsys()Tejun Heo1-76/+71
There are quite a few places where all loaded [builtin] subsys are iterated. Implement for_each_[builtin_]subsys() and replace manual iterations with those to simplify those places a bit. The new iterators automatically skip NULL subsystems. This shouldn't cause any functional difference. Iteration loops which scan all subsystems and then skipping modular ones explicitly are converted to use for_each_builtin_subsys(). While at it, reorder variable declarations and adjust whitespaces a bit in the affected functions. v2: Add lockdep_assert_held() in for_each_subsys() and add comments about synchronization as suggested by Li. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-06-25cgroup: move init_css_set initialization inside cgroup_mutexTejun Heo1-4/+4
cgroup_init() was doing init_css_set initialization outside cgroup_mutex, which is fine but we want to add lockdep annotation on subsystem iterations and cgroup_init() will trigger it spuriously. Move init_css_set initialization inside cgroup_mutex. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-06-25irqdomain: Use irq_get_trigger_type() to get IRQ flagsJavier Martinez Canillas1-1/+1
Use irq_get_trigger_type() to get the IRQ trigger type flags instead calling irqd_get_trigger_type(irq_desc_get_irq_data(virq)) Signed-off-by: Javier Martinez Canillas <[email protected]> Acked-by: Grant Likely <[email protected]> Cc: Linus Walleij <[email protected]> Cc: Samuel Ortiz <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Andrew Lunn <[email protected]> Cc: Russell King <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/1371228049-27080-8-git-send-email-javier.martinez@collabora.co.uk Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-24cgroup: s/for_each_subsys()/for_each_root_subsys()/Tejun Heo1-25/+22
for_each_subsys() walks over subsystems attached to a hierarchy and we're gonna add iterators which walk over all available subsystems. Rename for_each_subsys() to for_each_root_subsys() so that it's more appropriately named and for_each_subsys() can be used to iterate all subsystems. While at it, remove unnecessary underbar prefix from macro arguments, put them inside parentheses, and adjust indentation for the two for_each_*() macros. This patch is purely cosmetic. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-06-24cgroup: clean up find_css_set() and friendsTejun Heo1-21/+17
find_css_set() passes uninitialized on-stack template[] array to find_existing_css_set() which sets the entries for all subsystems. Passing around an uninitialized array is a bit icky and we want to introduce an iterator which only iterates loaded subsystems. Let's initialize it on definition. While at it, also make the following cosmetic cleanups. * Convert to proper /** comments. * Reorder variable declarations. * Replace comment on synchronization with lockdep_assert_held(). This patch doesn't make any functional differences. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-06-24cgroup: remove cgroup->actual_subsys_maskTejun Heo1-10/+12
cgroup curiously has two subsystem masks, ->subsys_mask and ->actual_subsys_mask. The latter only exists because the new target subsys_mask is passed into rebind_subsystems() via @root>subsys_mask. rebind_subsystems() needs to know what the current mask is to decide how to reach the target mask so ->actual_subsys_mask is used as the temp location to remember the current state. Adding a temporary field to a permanent data structure is rather silly and can be misleading. Update rebind_subsystems() to take @added_mask and @removed_mask instead and remove @root->actual_subsys_mask. This patch shouldn't introduce any behavior changes. v2: Comment and description updated as suggested by Li. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-06-24cgroup: prefix global variables with "cgroup_"Tejun Heo1-76/+77
Global variable names in kernel/cgroup.c are asking for trouble - subsys, roots, rootnode and so on. Rename them to have "cgroup_" prefix. * s/subsys/cgroup_subsys/ * s/rootnode/cgroup_dummy_root/ * s/dummytop/cgroup_cummy_top/ * s/roots/cgroup_roots/ * s/root_count/cgroup_root_count/ This patch is purely cosmetic. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-06-24Merge 3.10-rc7 into driver-core-nextGreg Kroah-Hartman10-106/+275
We want the firmware merge fixes, and other bits, in here now. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2013-06-24clockevents: Prefer CPU local devices over global devicesStephen Boyd1-2/+7
On an SMP system with only one global clockevent and a dummy clockevent per CPU we run into problems. We want the dummy clockevents to be registered as the per CPU tick devices, but we can only achieve that if we register the dummy clockevents before the global clockevent or if we artificially inflate the rating of the dummy clockevents to be higher than the rating of the global clockevent. Failure to do so leads to boot hangs when the dummy timers are registered on all other CPUs besides the CPU that accepted the global clockevent as its tick device and there is no broadcast timer to poke the dummy devices. If we're registering multiple clockevents and one clockevent is global and the other is local to a particular CPU we should choose to use the local clockevent regardless of the rating of the device. This way, if the clockevent is a dummy it will take the tick device duty as long as there isn't a higher rated tick device and any global clockevent will be bumped out into broadcast mode, fixing the problem described above. Reported-and-tested-by: Mark Rutland <[email protected]> Signed-off-by: Stephen Boyd <[email protected]> Tested-by: [email protected] Cc: John Stultz <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: [email protected] Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>