aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2016-03-02sched-clock: Migrate to use new tick dependency mask modelFrederic Weisbecker2-19/+5
Instead of checking sched_clock_stable from the nohz subsystem to verify its tick dependency, migrate it to the new mask in order to include it to the all-in-one check. Reviewed-by: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2016-03-02posix-cpu-timers: Migrate to use new tick dependency mask modelFrederic Weisbecker2-47/+12
Instead of providing asynchronous checks for the nohz subsystem to verify posix cpu timers tick dependency, migrate the latter to the new mask. In order to keep track of the running timers and expose the tick dependency accordingly, we must probe the timers queuing and dequeuing on threads and process lists. Unfortunately it implies both task and signal level dependencies. We should be able to further optimize this and merge all that on the task level dependency, at the cost of a bit of complexity and may be overhead. Reviewed-by: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2016-03-02sched: Migrate sched to use new tick dependency mask modelFrederic Weisbecker3-34/+53
Instead of providing asynchronous checks for the nohz subsystem to verify sched tick dependency, migrate sched to the new mask. Everytime a task is enqueued or dequeued, we evaluate the state of the tick dependency on top of the policy of the tasks in the runqueue, by order of priority: SCHED_DEADLINE: Need the tick in order to periodically check for runtime SCHED_FIFO : Don't need the tick (no round-robin) SCHED_RR : Need the tick if more than 1 task of the same priority for round robin (simplified with checking if more than one SCHED_RR task no matter what priority). SCHED_NORMAL : Need the tick if more than 1 task for round-robin. We could optimize that further with one flag per sched policy on the tick dependency mask and perform only the checks relevant to the policy concerned by an enqueue/dequeue operation. Since the checks aren't based on the current task anymore, we could get rid of the task switch hook but it's still needed for posix cpu timers. Reviewed-by: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2016-03-02sched: Account rr tasksFrederic Weisbecker2-0/+17
In order to evaluate the scheduler tick dependency without probing context switches, we need to know how much SCHED_RR and SCHED_FIFO tasks are enqueued as those policies don't have the same preemption requirements. To prepare for that, let's account SCHED_RR tasks, we'll be able to deduce SCHED_FIFO tasks as well from it and the total RT tasks in the runqueue. Reviewed-by: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2016-03-02perf: Migrate perf to use new tick dependency mask modelFrederic Weisbecker2-24/+49
Instead of providing asynchronous checks for the nohz subsystem to verify perf event tick dependency, migrate perf to the new mask. Perf needs the tick for two situations: 1) Freq events. We could set the tick dependency when those are installed on a CPU context. But setting a global dependency on top of the global freq events accounting is much easier. If people want that to be optimized, we can still refine that on the per-CPU tick dependency level. This patch dooesn't change the current behaviour anyway. 2) Throttled events: this is a per-cpu dependency. Reviewed-by: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2016-03-02nohz: Use enum code for tick stop failure tracing messageFrederic Weisbecker1-9/+9
It makes nohz tracing more lightweight, standard and easier to parse. Examples: user_loop-2904 [007] d..1 517.701126: tick_stop: success=1 dependency=NONE user_loop-2904 [007] dn.1 518.021181: tick_stop: success=0 dependency=SCHED posix_timers-6142 [007] d..1 1739.027400: tick_stop: success=0 dependency=POSIX_TIMER user_loop-5463 [007] dN.1 1185.931939: tick_stop: success=0 dependency=PERF_EVENTS Suggested-by: Peter Zijlstra <[email protected]> Reviewed-by: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2016-03-02nohz: New tick dependency maskFrederic Weisbecker2-7/+144
The tick dependency is evaluated on every IRQ and context switch. This consists is a batch of checks which determine whether it is safe to stop the tick or not. These checks are often split in many details: posix cpu timers, scheduler, sched clock, perf events.... each of which are made of smaller details: posix cpu timer involves checking process wide timers then thread wide timers. Perf involves checking freq events then more per cpu details. Checking these informations asynchronously every time we update the full dynticks state bring avoidable overhead and a messy layout. Let's introduce instead tick dependency masks: one for system wide dependency (unstable sched clock, freq based perf events), one for CPU wide dependency (sched, throttling perf events), and task/signal level dependencies (posix cpu timers). The subsystems are responsible for setting and clearing their dependency through a set of APIs that will take care of concurrent dependency mask modifications and kick targets to restart the relevant CPU tick whenever needed. This new dependency engine stays beside the old one until all subsystems having a tick dependency are converted to it. Suggested-by: Thomas Gleixner <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Reviewed-by: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2016-03-02sched/core: Get rid of 'cpu' argument in wq_worker_sleeping()Alexander Gordeev3-5/+4
Given that wq_worker_sleeping() could only be called for a CPU it is running on, we do not need passing a CPU ID as an argument. Suggested-by: Oleg Nesterov <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2016-03-01rcu: Make CPU_DYING_IDLE an explicit callThomas Gleixner3-35/+38
Make the RCU CPU_DYING_IDLE callback an explicit function call, so it gets invoked at the proper place. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Make wait for dead cpu completion basedThomas Gleixner2-8/+13
Kill the busy spinning on the control side and just wait for the hotplugged cpu to tell that it reached the dead state. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Let upcoming cpu bring itself fully upThomas Gleixner2-29/+39
Let the upcoming cpu kick the hotplug thread and let itself complete the bringup. That way the controll side can just wait for the completion or later when we made the hotplug machinery async not care at all. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Move online calls to hotplugged cpuThomas Gleixner1-48/+96
Let the hotplugged cpu invoke the setup/teardown callbacks (CPU_ONLINE/CPU_DOWN_PREPARE) itself. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Create hotplug threadsThomas Gleixner3-1/+147
In order to let the hotplugged cpu take care of the setup/teardown, we need a seperate hotplug thread. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Split out the state walk into functionsThomas Gleixner1-43/+68
We need that for running callbacks on the AP and the BP. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Unpark smpboot threads from the state machineThomas Gleixner3-38/+11
Handle the smpboot threads in the state machine. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Move scheduler cpu_online notifier to hotplug coreThomas Gleixner2-10/+18
Move the scheduler cpu online notifier part to the hotplug core. This is anyway the highest priority callback and we need that functionality right now for the next changes. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Implement setup/removal interfaceThomas Gleixner1-0/+224
Implement function which allow to setup/remove hotplug state callbacks. The default behaviour for setup is to call the startup function for this state for (or on) all cpus which have a hotplug state >= the installed state. The default behaviour for removal is to call the teardown function for this state for (or on) all cpus which have a hotplug state >= the installed state. This includes rollback to the previous state in case of failure. A special state is CPUHP_ONLINE_DYN. Its for dynamically registering a hotplug callback pair. This is for drivers which have no dependencies to avoid that we need to allocate CPUHP states for each of them For both setup and remove helper functions are provided, which prevent the core to issue the callbacks. This simplifies the conversion of existing hotplug notifiers. [ Dynamic registering implemented by Sebastian Siewior ] Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Make target state writeableThomas Gleixner1-8/+65
Make it possible to write a target state to the per cpu state file, so we can switch between states. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Add sysfs state interfaceThomas Gleixner1-0/+100
Add a sysfs interface so we can actually see in which state the cpus are in. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Hand in target state to _cpu_up/downThomas Gleixner1-11/+20
We want to be able to bringup/teardown the cpu to a particular state. Add a target argument to _cpu_up/down. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Convert the hotplugged cpu work to a state machineThomas Gleixner1-15/+66
Move the functions which need to run on the hotplugged processor into a state machine array and let the code iterate through these functions. In a later state, this will grow synchronization points between the control processor and the hotplugged processor, so we can move the various architecture implementations of the synchronizations to the core. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Convert to a state machine for the control processorThomas Gleixner1-26/+176
Move the split out steps into a callback array and let the cpu_up/down code iterate through the array functions. For now most of the callbacks are asymmetric to resemble the current hotplug maze. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Split out cpu down functionsThomas Gleixner1-30/+53
Split cpu_down in separate functions in preparation for state machine conversion. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Restructure cpu_up codeThomas Gleixner1-22/+47
Split out into separate functions, so we can convert it to a state machine. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cpu/hotplug: Restructure FROZEN state handlingThomas Gleixner1-40/+29
There are only a few callbacks which really care about FROZEN vs. !FROZEN. No need to have extra states for this. Publish the frozen state in an extra variable which is updated under the hotplug lock and let the users interested deal with it w/o imposing that extra state checks on everyone. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Rik van Riel <[email protected]> Cc: Rafael Wysocki <[email protected]> Cc: "Srivatsa S. Bhat" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Turner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-03-01cgroup: reset css on destructionVladimir Davydov1-0/+3
An associated css can be around for quite a while after a cgroup directory has been removed. In general, it makes sense to reset it to defaults so as not to worry about any remnants. For instance, memory cgroup needs to reset memory.low, otherwise pages charged to a dead cgroup might never get reclaimed. There's ->css_reset callback, which would fit perfectly for the purpose. Currently, it's only called when a subsystem is disabled in the unified hierarchy and there are other subsystems dependant on it. Let's call it on css destruction as well. Suggested-by: Johannes Weiner <[email protected]> Signed-off-by: Vladimir Davydov <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2016-03-01MODSIGN: linux/string.h should be #included to get memcpy()David Howells1-0/+1
linux/string.h should be #included in module_signing.c to get memcpy(), lest the following occur: kernel/module_signing.c: In function 'mod_verify_sig': kernel/module_signing.c:57:2: error: implicit declaration of function 'memcpy' [-Werror=implicit-function-declaration] memcpy(&ms, mod + (modlen - sizeof(ms)), sizeof(ms)); ^ Reported-by: kbuild test robot <[email protected]> Signed-off-by: David Howells <[email protected]>
2016-02-29cgroup: fix and restructure error handling in copy_cgroup_ns()Tejun Heo1-13/+5
copy_cgroup_ns()'s error handling was broken and the attempt to fix it d22025570e2e ("cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()") was broken too in that it ended up trying an ERR_PTR() value. There's only one place where copy_cgroup_ns() needs to perform cleanup after failure. Simplify and fix the error handling by removing the goto's. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Dan Carpenter <[email protected]> Acked-by: Serge E. Hallyn <[email protected]>
2016-02-29tracing/syscalls: Rename "/format" tracepoint field name "nr" to "__syscall_nr:Taeung Song1-7/+9
Some tracepoint have multiple fields with the same name, "nr", the first one is a unique syscall ID, the other is a syscall argument: # cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_io_getevents/format name: sys_enter_io_getevents ID: 747 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:int nr; offset:8; size:4; signed:1; field:aio_context_t ctx_id; offset:16; size:8; signed:0; field:long min_nr; offset:24; size:8; signed:0; field:long nr; offset:32; size:8; signed:0; field:struct io_event * events; offset:40; size:8; signed:0; field:struct timespec * timeout; offset:48; size:8; signed:0; print fmt: "ctx_id: 0x%08lx, min_nr: 0x%08lx, nr: 0x%08lx, events: 0x%08lx, timeout: 0x%08lx", ((unsigned long)(REC->ctx_id)), ((unsigned long)(REC->min_nr)), ((unsigned long)(REC->nr)), ((unsigned long)(REC->events)), ((unsigned long)(REC->timeout)) # Fix it by renaming the "/format" common tracepoint field "nr" to "__syscall_nr". Signed-off-by: Taeung Song <[email protected]> [ Do not rename the struct member, just the '/format' field name ] Signed-off-by: Steven Rostedt <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2016-02-29Handle ISO 8601 leap seconds and encodings of midnight in mktime64()David Howells1-1/+8
Handle the following ISO 8601 features in mktime64(): (1) Leap seconds. Leap seconds are indicated by the seconds parameter being the value 60. Handle this by treating it the same as 00 of the following minute. It has been pointed out that a minute may contain two leap seconds. However, pending discussion of what that looks like and how to handle it, I'm not going to concern myself with it. (2) Alternate encodings of midnight. Two different encodings of midnight are permitted - 00:00:00 and 24:00:00 - the first is midnight today and the second is midnight tomorrow and is exactly equivalent to the first with tomorrow's date. As it happens, we don't actually need to change mktime64() to handle either of these - just comment them as valid parameters. These facility will be used by the X.509 parser. Doing it in mktime64() makes the policy common to the whole kernel and easier to find. Signed-off-by: David Howells <[email protected]> Acked-by: Arnd Bergmann <[email protected]> cc: John Stultz <[email protected]> cc: Rudolf Polzer <[email protected]> cc: One Thousand Gnomes <[email protected]>
2016-02-29locking/lockdep: Detect chain_key collisionsIngo Molnar1-8/+51
Add detection for chain_key collision under CONFIG_DEBUG_LOCKDEP. When a collision is detected the problem is reported and all lock debugging is turned off. Tested using liblockdep and the added tests before and after applying the fix, confirming both that the code added for the detection correctly reports the problem and that the fix actually fixes it. Tested tweaking lockdep to generate false collisions and verified that the problem is reported and that lock debugging is turned off. Also tested with lockdep's test suite after applying the patch: [ 0.000000] Good, all 253 testcases passed! | Signed-off-by: Alfredo Alvarez Fernandez <[email protected]> Cc: Alfredo Alvarez Fernandez <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/1455864533-7536-4-git-send-email-alfredoalvarezernandez@gmail.com Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29locking/lockdep: Prevent chain_key collisionsAlfredo Alvarez Fernandez1-8/+6
The chain_key hashing macro iterate_chain_key(key1, key2) does not generate a new different value if both key1 and key2 are 0. In that case the generated value is again 0. This can lead to collisions which can result in lockdep not detecting deadlocks or circular dependencies. Avoid the problem by using class_idx (1-based) instead of class id (0-based) as an input for the hashing macro 'key2' in iterate_chain_key(key1, key2). The use of class id created collisions in cases like the following: 1.- Consider an initial state in which no class has been acquired yet. Under these circumstances an AA deadlock will not be detected by lockdep: lock [key1,key2]->new key (key1=old chain_key, key2=id) -------------------------- A [0,0]->0 A [0,0]->0 (collision) The newly generated chain_key collides with the one used before and as a result the check for a deadlock is skipped A simple test using liblockdep and a pthread mutex confirms the problem: (omitting stack traces) new class 0xe15038: 0x7ffc64950f20 acquire class [0xe15038] 0x7ffc64950f20 acquire class [0xe15038] 0x7ffc64950f20 hash chain already cached, key: 0000000000000000 tail class: [0xe15038] 0x7ffc64950f20 2.- Consider an ABBA in 2 different tasks and no class yet acquired. T1 [key1,key2]->new key T2[key1,key2]->new key -- -- A [0,0]->0 B [0,1]->1 B [0,1]->1 (collision) A In this case the collision prevents lockdep from creating the new dependency A->B. This in turn results in lockdep not detecting the circular dependency when T2 acquires A. Signed-off-by: Alfredo Alvarez Fernandez <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/1455147212-2389-4-git-send-email-alfredoalvarezernandez@gmail.com Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29locking/mutex: Allow next waiter lockless wakeupDavidlohr Bueso1-2/+3
Make use of wake-queues and enable the wakeup to occur after releasing the wait_lock. This is similar to what we do with rtmutex top waiter, slightly shortening the critical region and allow other waiters to acquire the wait_lock sooner. In low contention cases it can also help the recently woken waiter to find the wait_lock available (fastpath) when it continues execution. Reviewed-by: Waiman Long <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Ding Tianhong <[email protected]> Cc: Jason Low <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tim Chen <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29locking/pvqspinlock: Enable slowpath locking count trackingWaiman Long2-0/+8
This patch enables the tracking of the number of slowpath locking operations performed. This can be used to compare against the number of lock stealing operations to see what percentage of locks are stolen versus acquired via the regular slowpath. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29locking/qspinlock: Use smp_cond_acquire() in pending codeWaiman Long1-4/+3
The newly introduced smp_cond_acquire() was used to replace the slowpath lock acquisition loop. Similarly, the new function can also be applied to the pending bit locking loop. This patch uses the new function in that loop. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29locking/pvqspinlock: Move lock stealing count tracking code into ↵Waiman Long2-20/+9
pv_queued_spin_steal_lock() This patch moves the lock stealing count tracking code into pv_queued_spin_steal_lock() instead of via a jacket function simplifying the code. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29locking/mcs: Fix mcs_spin_lock() orderingPeter Zijlstra1-1/+7
Similar to commit b4b29f94856a ("locking/osq: Fix ordering of node initialisation in osq_lock") the use of xchg_acquire() is fundamentally broken with MCS like constructs. Furthermore, it turns out we rely on the global transitivity of this operation because the unlock path observes the pointer with a READ_ONCE(), not an smp_load_acquire(). This is non-critical because the MCS code isn't actually used and mostly serves as documentation, a stepping stone to the more complex things we've build on top of the idea. Reported-by: Andrea Parri <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Fixes: 3552a07a9c4a ("locking/mcs: Use acquire/release semantics") Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29Merge tag 'v4.5-rc6' into locking/core, to pick up fixesIngo Molnar8-173/+261
Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29sched/deadline: Remove superfluous call to switched_to_dl()Peter Zijlstra1-2/+1
if (A || B) { } else if (A && !B) { } If A we'll take the first branch, if !A we will not satisfy the second. Therefore the second branch will never be taken. Reported-by: luca abeni <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Juri Lelli <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29sched/debug: Fix preempt_disable_ip recording for preempt_disable()Sebastian Andrzej Siewior2-14/+4
The preempt_disable() invokes preempt_count_add() which saves the caller in ->preempt_disable_ip. It uses CALLER_ADDR1 which does not look for its caller but for the parent of the caller. Which means we get the correct caller for something like spin_lock() unless the architectures inlines those invocations. It is always wrong for preempt_disable() or local_bh_disable(). This patch makes the function get_lock_parent_ip() which tries CALLER_ADDR0,1,2 if the former is a locking function. This seems to record the preempt_disable() caller properly for preempt_disable() itself as well as for get_cpu_var() or local_bh_disable(). Steven asked for the get_parent_ip() -> get_lock_parent_ip() rename. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularityRik van Riel1-16/+23
When profiling syscall overhead on nohz-full kernels, after removing __acct_update_integrals() from the profile, native_sched_clock() remains as the top CPU user. This can be reduced by moving VIRT_CPU_ACCOUNTING_GEN to jiffy granularity. This will reduce timing accuracy on nohz_full CPUs to jiffy based sampling, just like on normal CPUs. It results in totally removing native_sched_clock from the profile, and significantly speeding up the syscall entry and exit path, as well as irq entry and exit, and KVM guest entry & exit. Additionally, only call the more expensive functions (and advance the seqlock) when jiffies actually changed. This code relies on another CPU advancing jiffies when the system is busy. On a nohz_full system, this is done by a housekeeping CPU. A microbenchmark calling an invalid syscall number 10 million times in a row speeds up an additional 30% over the numbers with just the previous patches, for a total speedup of about 40% over 4.4 and 4.5-rc1. Run times for the microbenchmark: 4.4 3.8 seconds 4.5-rc1 3.7 seconds 4.5-rc1 + first patch 3.3 seconds 4.5-rc1 + first 3 patches 3.1 seconds 4.5-rc1 + all patches 2.3 seconds A non-NOHZ_FULL cpu (not the housekeeping CPU): all kernels 1.86 seconds Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29time, acct: Drop irq save & restore from __acct_update_integrals()Rik van Riel1-5/+4
It looks like all the call paths that lead to __acct_update_integrals() already have irqs disabled, and __acct_update_integrals() does not need to disable irqs itself. This is very convenient since about half the CPU time left in this function was spent in local_irq_save alone. Performance of a microbenchmark that calls an invalid syscall ten million times in a row on a nohz_full CPU improves 21% vs. 4.5-rc1 with both the removal of divisions from __acct_update_integrals() and this patch, with runtime dropping from 3.7 to 2.9 seconds. With these patches applied, the highest remaining cpu user in the trace is native_sched_clock, which is addressed in the next patch. For testing purposes I stuck a WARN_ON(!irqs_disabled()) test in __acct_update_integrals(). It did not trigger. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29acct, time: Change indentation in __acct_update_integrals()Rik van Riel1-25/+26
Change the indentation in __acct_update_integrals() to make the function a little easier to read. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29sched, time: Remove non-power-of-two divides from __acct_update_integrals()Rik van Riel1-10/+16
When running a microbenchmark calling an invalid syscall number in a loop, on a nohz_full CPU, we spend a full 9% of our CPU time in __acct_update_integrals(). This function converts cputime_t to jiffies, to a timeval, only to convert the timeval back to microseconds before discarding it. This patch leaves __acct_update_integrals() functionally equivalent, but speeds things up by about 12%, with 10 million calls to an invalid syscall number dropping from 3.7 to 3.25 seconds. Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29sched/rt: Kick RT bandwidth timer immediately on start upSteven Rostedt1-1/+9
I've been debugging why deadline tasks can cause the RT scheduler to throttle, even when the deadline tasks are only taking up 50% of the CPU and RT tasks are not even using 1% of the CPU. Here's what I found. In order to keep a CPU from being hogged by RT tasks, the deadline scheduler adds its run time (delta_exec) to the rt_time of the RT bandwidth. That way, if the two use more than 95% of the CPU within one second (default settings), the RT tasks are throttled to allow non RT tasks to run. Although the deadline tasks add their run time to the RT bandwidth, it lets the RT tasks do the accounting. This is where the problem lies. If a deadline task runs for a bit, and no RT tasks are running, then it will continually add to the RT rt_time that is used to calculate how much CPU the RT tasks use. But no RT period is in play, and this accumulation of the runtime never gets reset. When an RT task finally gets to run, and the watchdog goes off, it can see that the RT task has used more than it should of, because the deadline task added all this runtime to its rt_time. Then the RT task that just woke up gets throttled for no good reason. I also noticed that when an RT task is queued, it starts the timer to account for overload and such. But that timer goes off one period later, which may be too late and the extra rt_time will trigger a throttle. This is a quick work around to the problem. When a new RT task is queued, the bandwidth timer is set to go off immediately. Then the timer can clear out the extra time added to the rt_time while there was no RT task running. This stops my tests from triggering the throttle, and it will still throttle if an RT task runs too much, even while a deadline task is running. A better solution may be to subtract the bandwidth that the deadline task uses from the rt_runtime, and add it back when its finished. Then there wont be a need for runtime tracking of the time used by deadline tasks. I may play with that solution tomorrow. Signed-off-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Clark Williams Cc: Daniel Bristot de Oliveira <[email protected]> Cc: John Kacur <[email protected]> Cc: Juri Lelli Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debugSteven Rostedt (Red Hat)1-0/+9
Playing with SCHED_DEADLINE and cpusets, I found that I was unable to create new SCHED_DEADLINE tasks, with the error of EBUSY as if the bandwidth was already used up. I then realized there wa no way to see what bandwidth is used by the runqueues to debug the issue. By adding the dl_bw->bw and dl_bw->total_bw to the output of the deadline info in /proc/sched_debug, this allows us to see what bandwidth has been reserved and where a problem may exist. For example, before the issue we see the ratio of the bandwidth: # cat /proc/sys/kernel/sched_rt_runtime_us 950000 # cat /proc/sys/kernel/sched_rt_period_us 1000000 # grep dl /proc/sched_debug dl_rq[0]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[1]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[2]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[3]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[4]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[5]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[6]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[7]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 Note: (950000 / 1000000) << 20 == 996147 After I played with cpusets and hit the issue, the result is now: # grep dl /proc/sched_debug dl_rq[0]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : -104857 dl_rq[1]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 104857 dl_rq[2]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 104857 dl_rq[3]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 104857 dl_rq[4]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : -104857 dl_rq[5]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : -104857 dl_rq[6]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : -104857 dl_rq[7]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : -104857 This shows that there is definitely a problem as we should never have a negative total bandwidth. Signed-off-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Clark Williams <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29sched/debug: Move sched_domain_sysctl to debug.cSteven Rostedt (Red Hat)3-178/+186
The sched_domain_sysctl setup is only enabled when SCHED_DEBUG is configured. As debug.c is only compiled when SCHED_DEBUG is configured as well, move the setup of sched_domain_sysctl into that file. Note, the (un)register_sched_domain_sysctl() functions had to be changed from static to allow access to them from core.c. Signed-off-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Clark Williams <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.cSteven Rostedt (Red Hat)2-133/+131
As /sys/kernel/debug/sched_features is only created when SCHED_DEBUG is enabled, and the file debug.c is only compiled when SCHED_DEBUG is enabled, it makes sense to move sched_feature setup into that file and get rid of the #ifdef. Signed-off-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Clark Williams <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29sched/rt: Fix PI handling vs. sched_setscheduler()Peter Zijlstra3-50/+111
Andrea Parri reported: > I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not > handled correctly: > > T1 (prio = 20) > lock(rtmutex); > > T2 (prio = 20) > blocks on rtmutex (rt_nr_boosted = 0 on T1's rq) > > T1 (prio = 20) > sys_set_scheduler(prio = 0) > [new_effective_prio == oldprio] > T1 prio = 20 (rt_nr_boosted = 0 on T1's rq) > > The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted()); > in particular, if we continue with > > T1 (prio = 20) > unlock(rtmutex) > wakeup(T2) > adjust_prio(T1) > [prio != rt_mutex_getprio(T1)] > dequeue(T1) > rt_nr_boosted = (unsigned long)(-1) > ... > T1 prio = 0 > > then we end up leaving rt_nr_boosted in an "inconsistent" state. > > The simple program attached could reproduce the previous scenario; note > that, as a consequence of the presence of this state, the "assertion" > > WARN_ON(!rt_nr_running && rt_nr_boosted) > > from dec_rt_group() may trigger. So normally we dequeue/enqueue tasks in sched_setscheduler(), which would ensure the accounting stays correct. However in the early PI path we fail to do so. So this was introduced at around v3.14, by: c365c292d059 ("sched: Consider pi boosting in setscheduler()") which fixed another problem exactly because that dequeue/enqueue, joy. Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it preserve runqueue location with that option. This requires decoupling the on_rt_rq() state from being on the list. In order to allow for explicit movement during the SAVE/RESTORE, introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these cases to preserve other invariants. Respecting the SAVE/RESTORE flags also has the (nice) side-effect that things like sys_nice()/sys_sched_setaffinity() also do not reorder FIFO tasks (whereas they used to before this patch). Reported-by: Andrea Parri <[email protected]> Tested-by: Andrea Parri <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2016-02-29sched/core: Remove duplicated sched_group_set_shares() prototypeDongsheng Yang1-1/+0
Signed-off-by: Dongsheng Yang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>