aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2014-01-13sched: Pass 'struct rq' to on_null_domain()Daniel Lezcano1-4/+4
The on_null_domain() function is getting the cpu to retrieve the struct rq associated with it. Pass 'struct rq' directly to the function as the caller already has the info. Signed-off-by: Daniel Lezcano <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched: Reduce nohz_kick_needed() parametersDaniel Lezcano1-4/+4
The cpu information is already stored in the struct rq, so no need to pass it as parameter to the nohz_kick_needed function. The caller of this function just called idle_cpu() before to fill the rq->idle_balance field. Use rq->cpu and rq->idle_balance. Signed-off-by: Daniel Lezcano <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched: Reduce trigger_load_balance() parametersDaniel Lezcano3-3/+5
The cpu information is already stored in the struct rq, so no need to pass it as parameter to the trigger_load_balance function. Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Daniel Lezcano <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched/deadline: Fix hotplug admission controlPeter Zijlstra1-51/+32
The current hotplug admission control is broken because: CPU_DYING -> migration_call() -> migrate_tasks() -> __migrate_task() cannot fail and hard assumes it _will_ move all tasks off of the dying cpu, failing this will break hotplug. The much simpler solution is a DOWN_PREPARE handler that fails when removing one CPU gets us below the total allocated bandwidth. Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched/deadline: Remove the sysctl_sched_dl knobsPeter Zijlstra4-221/+97
Remove the deadline specific sysctls for now. The problem with them is that the interaction with the exisiting rt knobs is nearly impossible to get right. The current (as per before this patch) situation is that the rt and dl bandwidth is completely separate and we enforce rt+dl < 100%. This is undesirable because this means that the rt default of 95% leaves us hardly any room, even though dl tasks are saver than rt tasks. Another proposed solution was (a discarted patch) to have the dl bandwidth be a fraction of the rt bandwidth. This is highly confusing imo. Furthermore neither proposal is consistent with the situation we actually want; which is rt tasks ran from a dl server. In which case the rt bandwidth is a direct subset of dl. So whichever way we go, the introduction of dl controls at this point is painful. Therefore remove them and instead share the rt budget. This means that for now the rt knobs are used for dl admission control and the dl runtime is accounted against the rt runtime. I realise that this isn't entirely desirable either; but whatever we do we appear to need to change the interface later, so better have a small interface for now. Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched/deadline: Fix up the smp-affinity mask testsPeter Zijlstra1-19/+9
For now deadline tasks are not allowed to set smp affinity; however the current tests are wrong, cure this. The test in __sched_setscheduler() also uses an on-stack cpumask_t which is a no-no. Change both tests to use cpumask_subset() such that we test the root domain span to be a subset of the cpus_allowed mask. This way we're sure the tasks can always run on all CPUs they can be balanced over, and have no effective affinity constraints. Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched/deadline: speed up SCHED_DEADLINE pushes with a push-heapJuri Lelli6-40/+269
Data from tests confirmed that the original active load balancing logic didn't scale neither in the number of CPU nor in the number of tasks (as sched_rt does). Here we provide a global data structure to keep track of deadlines of the running tasks in the system. The structure is composed by a bitmask showing the free CPUs and a max-heap, needed when the system is heavily loaded. The implementation and concurrent access scheme are kept simple by design. However, our measurements show that we can compete with sched_rt on large multi-CPUs machines [1]. Only the push path is addressed, the extension to use this structure also for pull decisions is straightforward. However, we are currently evaluating different (in order to decrease/avoid contention) data structures to solve possibly both problems. We are also going to re-run tests considering recent changes inside cpupri [2]. [1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf [2] http://www.spinics.net/lists/linux-rt-users/msg06778.html Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched/deadline: Add bandwidth management for SCHED_DEADLINE tasksDario Faggioli4-36/+541
In order of deadline scheduling to be effective and useful, it is important that some method of having the allocation of the available CPU bandwidth to tasks and task groups under control. This is usually called "admission control" and if it is not performed at all, no guarantee can be given on the actual scheduling of the -deadline tasks. Since when RT-throttling has been introduced each task group have a bandwidth associated to itself, calculated as a certain amount of runtime over a period. Moreover, to make it possible to manipulate such bandwidth, readable/writable controls have been added to both procfs (for system wide settings) and cgroupfs (for per-group settings). Therefore, the same interface is being used for controlling the bandwidth distrubution to -deadline tasks and task groups, i.e., new controls but with similar names, equivalent meaning and with the same usage paradigm are added. However, more discussion is needed in order to figure out how we want to manage SCHED_DEADLINE bandwidth at the task group level. Therefore, this patch adds a less sophisticated, but actually very sensible, mechanism to ensure that a certain utilization cap is not overcome per each root_domain (the single rq for !SMP configurations). Another main difference between deadline bandwidth management and RT-throttling is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!), and thus we don't need an higher level throttling mechanism to enforce the desired bandwidth. This patch, therefore: - adds system wide deadline bandwidth management by means of: * /proc/sys/kernel/sched_dl_runtime_us, * /proc/sys/kernel/sched_dl_period_us, that determine (i.e., runtime / period) the total bandwidth available on each CPU of each root_domain for -deadline tasks; - couples the RT and deadline bandwidth management, i.e., enforces that the sum of how much bandwidth is being devoted to -rt -deadline tasks to stay below 100%. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created until the sum of their bandwidths stay below: M * (sched_dl_runtime_us / sched_dl_period_us) It is also possible to disable this bandwidth management logic, and be thus free of oversubscribing the system up to any arbitrary level. Signed-off-by: Dario Faggioli <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched/deadline: Add SCHED_DEADLINE inheritance logicDario Faggioli7-53/+122
Some method to deal with rt-mutexes and make sched_dl interact with the current PI-coded is needed, raising all but trivial issues, that needs (according to us) to be solved with some restructuring of the pi-code (i.e., going toward a proxy execution-ish implementation). This is under development, in the meanwhile, as a temporary solution, what this commits does is: - ensure a pi-lock owner with waiters is never throttled down. Instead, when it runs out of runtime, it immediately gets replenished and it's deadline is postponed; - the scheduling parameters (relative deadline and default runtime) used for that replenishments --during the whole period it holds the pi-lock-- are the ones of the waiting task with earliest deadline. Acting this way, we provide some kind of boosting to the lock-owner, still by using the existing (actually, slightly modified by the previous commit) pi-architecture. We would stress the fact that this is only a surely needed, all but clean solution to the problem. In the end it's only a way to re-start discussion within the community. So, as always, comments, ideas, rants, etc.. are welcome! :-) Signed-off-by: Dario Faggioli <[email protected]> Signed-off-by: Juri Lelli <[email protected]> [ Added !RT_MUTEXES build fix. ] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13rtmutex: Turn the plist into an rb-treePeter Zijlstra6-52/+138
Turn the pi-chains from plist to rb-tree, in the rt_mutex code, and provide a proper comparison function for -deadline and -priority tasks. This is done mainly because: - classical prio field of the plist is just an int, which might not be enough for representing a deadline; - manipulating such a list would become O(nr_deadline_tasks), which might be to much, as the number of -deadline task increases. Therefore, an rb-tree is used, and tasks are queued in it according to the following logic: - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the one with the higher (lower, actually!) prio wins; - among a -priority and a -deadline task, the latter always wins; - among two -deadline tasks, the one with the earliest deadline wins. Queueing and dequeueing functions are changed accordingly, for both the list of a task's pi-waiters and the list of tasks blocked on a pi-lock. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Dario Faggioli <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-again-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched/deadline: Add latency tracing for SCHED_DEADLINE tasksDario Faggioli2-18/+79
It is very likely that systems that wants/needs to use the new SCHED_DEADLINE policy also want to have the scheduling latency of the -deadline tasks under control. For this reason a new version of the scheduling wakeup latency, called "wakeup_dl", is introduced. As a consequence of applying this patch there will be three wakeup latency tracer: * "wakeup", that deals with all tasks in the system; * "wakeup_rt", that deals with -rt and -deadline tasks only; * "wakeup_dl", that deals with -deadline tasks only. Signed-off-by: Dario Faggioli <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched/deadline: Add period support for SCHED_DEADLINE tasksHarald Gustafsson2-5/+15
Make it possible to specify a period (different or equal than deadline) for -deadline tasks. Relative deadlines (D_i) are used on task arrivals to generate new scheduling (absolute) deadlines as "d = t + D_i", and periods (P_i) to postpone the scheduling deadlines as "d = d + P_i" when the budget is zero. This is in general useful to model (and schedule) tasks that have slow activation rates (long periods), but have to be scheduled soon once activated (short deadlines). Signed-off-by: Harald Gustafsson <[email protected]> Signed-off-by: Dario Faggioli <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched/deadline: Add SCHED_DEADLINE avg_update accountingDario Faggioli1-0/+2
Make the core scheduler and load balancer aware of the load produced by -deadline tasks, by updating the moving average like for sched_rt. Signed-off-by: Dario Faggioli <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logicJuri Lelli4-17/+962
Introduces data structures relevant for implementing dynamic migration of -deadline tasks and the logic for checking if runqueues are overloaded with -deadline tasks and for choosing where a task should migrate, when it is the case. Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can be moved among CPUs when necessary. It is also possible to bind a task to a (set of) CPU(s), thus restricting its capability of migrating, or forbidding migrations at all. The very same approach used in sched_rt is utilised: - -deadline tasks are kept into CPU-specific runqueues, - -deadline tasks are migrated among runqueues to achieve the following: * on an M-CPU system the M earliest deadline ready tasks are always running; * affinity/cpusets settings of all the -deadline tasks is always respected. Therefore, this very special form of "load balancing" is done with an active method, i.e., the scheduler pushes or pulls tasks between runqueues when they are woken up and/or (de)scheduled. IOW, every time a preemption occurs, the descheduled task might be sent to some other CPU (depending on its deadline) to continue executing (push). On the other hand, every time a CPU becomes idle, it might pull the second earliest deadline ready task from some other CPU. To enforce this, a pull operation is always attempted before taking any scheduling decision (pre_schedule()), as well as a push one after each scheduling decision (post_schedule()). In addition, when a task arrives or wakes up, the best CPU where to resume it is selected taking into account its affinity mask, the system topology, but also its deadline. E.g., from the scheduling point of view, the best CPU where to wake up (and also where to push) a task is the one which is running the task with the latest deadline among the M executing ones. In order to facilitate these decisions, per-runqueue "caching" of the deadlines of the currently running and of the first ready task is used. Queued but not running tasks are also parked in another rb-tree to speed-up pushes. Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Dario Faggioli <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched/deadline: Add SCHED_DEADLINE structures & implementationDario Faggioli7-19/+812
Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <[email protected]> Signed-off-by: Michael Trimarchi <[email protected]> Signed-off-by: Fabio Checconi <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13sched: Add new scheduler syscalls to support an extended scheduling ↵Dario Faggioli2-23/+249
parameters ABI Add the syscalls needed for supporting scheduling algorithms with extended scheduling parameters (e.g., SCHED_DEADLINE). In general, it makes possible to specify a periodic/sporadic task, that executes for a given amount of runtime at each instance, and is scheduled according to the urgency of their own timing constraints, i.e.: - a (maximum/typical) instance execution time, - a minimum interval between consecutive instances, - a time constraint by which each instance must be completed. Thus, both the data structure that holds the scheduling parameters of the tasks and the system calls dealing with it must be extended. Unfortunately, modifying the existing struct sched_param would break the ABI and result in potentially serious compatibility issues with legacy binaries. For these reasons, this patch: - defines the new struct sched_attr, containing all the fields that are necessary for specifying a task in the computational model described above; - defines and implements the new scheduling related syscalls that manipulate it, i.e., sched_setattr() and sched_getattr(). Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a proof of concept and for developing and testing purposes. Making them available on other architectures is straightforward. Since no "user" for these new parameters is introduced in this patch, the implementation of the new system calls is just identical to their already existing counterpart. Future patches that implement scheduling policies able to exploit the new data structure must also take care of modifying the sched_*attr() calls accordingly with their own purposes. Signed-off-by: Dario Faggioli <[email protected]> [ Rewrote to use sched_attr. ] Signed-off-by: Juri Lelli <[email protected]> [ Removed sched_setscheduler2() for now. ] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13Merge branch 'sched/urgent' into sched/coreIngo Molnar17-60/+110
Pick up the latest fixes before applying new changes. Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13futexes: Avoid taking the hb->lock if there's nothing to wake upDavidlohr Bueso1-25/+92
In futex_wake() there is clearly no point in taking the hb->lock if we know beforehand that there are no tasks to be woken. While the hash bucket's plist head is a cheap way of knowing this, we cannot rely 100% on it as there is a racy window between the futex_wait call and when the task is actually added to the plist. To this end, we couple it with the spinlock check as tasks trying to enter the critical region are most likely potential waiters that will be added to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters. Furthermore, the futex ordering guarantees are preserved, ensuring that waiters either observe the changed user space value before blocking or is woken by a concurrent waker. For wakers, this is done by relying on the barriers in get_futex_key_refs() -- for archs that do not have implicit mb in atomic_inc(), we explicitly add them through a new futex_get_mm function. For waiters we rely on the fact that spin_lock calls already update the head counter, so spinners are visible even if the lock hasn't been acquired yet. For more details please refer to the updated comments in the code and related discussion: https://lkml.org/lkml/2013/11/26/556 Special thanks to tglx for careful review and feedback. Suggested-by: Linus Torvalds <[email protected]> Reviewed-by: Darren Hart <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Jeff Mahoney <[email protected]> Cc: Scott Norton <[email protected]> Cc: Tom Vaden <[email protected]> Cc: Aswin Chandramouleeswaran <[email protected]> Cc: Waiman Long <[email protected]> Cc: Jason Low <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13futexes: Document multiprocessor ordering guaranteesThomas Gleixner1-0/+57
That's essential, if you want to hack on futexes. Reviewed-by: Darren Hart <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Jeff Mahoney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Scott Norton <[email protected]> Cc: Tom Vaden <[email protected]> Cc: Aswin Chandramouleeswaran <[email protected]> Cc: Waiman Long <[email protected]> Cc: Jason Low <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13futexes: Increase hash table size for better performanceDavidlohr Bueso1-7/+19
Currently, the futex global hash table suffers from its fixed, smallish (for today's standards) size of 256 entries, as well as its lack of NUMA awareness. Large systems, using many futexes, can be prone to high amounts of collisions; where these futexes hash to the same bucket and lead to extra contention on the same hb->lock. Furthermore, cacheline bouncing is a reality when we have multiple hb->locks residing on the same cacheline and different futexes hash to adjacent buckets. This patch keeps the current static size of 16 entries for small systems, or otherwise, 256 * ncpus (or larger as we need to round the number to a power of 2). Note that this number of CPUs accounts for all CPUs that can ever be available in the system, taking into consideration things like hotpluging. While we do impose extra overhead at bootup by making the hash table larger, this is a one time thing, and does not shadow the benefits of this patch. Furthermore, as suggested by tglx, by cache aligning the hash buckets we can avoid access across cacheline boundaries and also avoid massive cache line bouncing if multiple cpus are hammering away at different hash buckets which happen to reside in the same cache line. Also, similar to other core kernel components (pid, dcache, tcp), by using alloc_large_system_hash() we benefit from its NUMA awareness and thus the table is distributed among the nodes instead of in a single one. For a custom microbenchmark that pounds on the uaddr hashing -- making the wait path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server: +---------+--------------------+------------------------+-----------------------+-------------------------------+ | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) | +---------+--------------------+------------------------+-----------------------+-------------------------------+ |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             | |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             | |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             | |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             | |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             | |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              | +---------+--------------------+------------------------+-----------------------+-------------------------------+ Reviewed-by: Darren Hart <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Reviewed-by: Waiman Long <[email protected]> Reviewed-and-tested-by: Jason Low <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Jeff Mahoney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Scott Norton <[email protected]> Cc: Tom Vaden <[email protected]> Cc: Aswin Chandramouleeswaran <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13futexes: Clean up various detailsJason Low1-27/+12
- Remove unnecessary head variables. - Delete unused parameter in queue_unlock(). Reviewed-by: Darren Hart <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Jason Low <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Jeff Mahoney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Scott Norton <[email protected]> Cc: Tom Vaden <[email protected]> Cc: Aswin Chandramouleeswaran <[email protected]> Cc: Waiman Long <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-13Merge tag 'v3.13-rc8' into core/lockingIngo Molnar14-112/+158
Refresh the tree with the latest fixes, before applying new changes. Signed-off-by: Ingo Molnar <[email protected]>
2014-01-12Merge branch 'fortglx/3.14/time' of ↵Ingo Molnar4-27/+29
git://git.linaro.org/people/john.stultz/linux into timers/core Pull timekeeping updates from John Stultz. Signed-off-by: Ingo Molnar <[email protected]>
2014-01-12Merge branch 'linus' into timers/coreIngo Molnar26-170/+226
Pick up the latest fixes and refresh the branch. Signed-off-by: Ingo Molnar <[email protected]>
2014-01-12perf: Introduce a flag to enable close-on-exec in perf_event_open()Yann Droneaud1-3/+9
Unlike recent modern userspace API such as: epoll_create1 (EPOLL_CLOEXEC), eventfd (EFD_CLOEXEC), fanotify_init (FAN_CLOEXEC), inotify_init1 (IN_CLOEXEC), signalfd (SFD_CLOEXEC), timerfd_create (TFD_CLOEXEC), or the venerable general purpose open (O_CLOEXEC), perf_event_open() syscall lack a flag to atomically set FD_CLOEXEC (eg. close-on-exec) flag on file descriptor it returns to userspace. The present patch adds a PERF_FLAG_FD_CLOEXEC flag to allow perf_event_open() syscall to atomically set close-on-exec. Having this flag will enable userspace to remove the file descriptor from the list of file descriptors being inherited across exec, without the need to call fcntl(fd, F_SETFD, FD_CLOEXEC) and the associated race condition between the current thread and another thread calling fork(2) then execve(2). Links: - Secure File Descriptor Handling (Ulrich Drepper, 2008) http://udrepper.livejournal.com/20407.html - Excuse me son, but your code is leaking !!! (Dan Walsh, March 2012) http://danwalsh.livejournal.com/53603.html - Notes in DMA buffer sharing: leak and security hole http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/dma-buf-sharing.txt?id=v3.13-rc3#n428 Signed-off-by: Yann Droneaud <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Al Viro <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/8c03f54e1598b1727c19706f3af03f98685d9fe6.1388952061.git.ydroneaud@opteya.com Signed-off-by: Ingo Molnar <[email protected]>
2014-01-12perf/x86: Fix active_entry initializationStephane Eranian1-0/+2
This patch fixes a problem with the initialization of the struct perf_event active_entry field. It is defined inside an anonymous union and was initialized in perf_event_alloc() using INIT_LIST_HEAD(). However at that time, we do not know whether the event is going to use active_entry or hlist_entry (SW). Or at last, we don't want to make that determination there. The problem is that hlist and list_head are not initialized the same way. One is okay with NULL (from kzmalloc), the other needs to pointers to point to self. This patch resolves this problem by dropping the union. This will avoid problems later on, if someone starts using active_entry or hlist_entry without verifying that they actually overlap. This also solves the initialization problem. Signed-off-by: Stephane Eranian <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-12sched_clock: Disable seqlock lockdep usage in sched_clock()John Stultz1-3/+3
Unfortunately the seqlock lockdep enablement can't be used in sched_clock(), since the lockdep infrastructure eventually calls into sched_clock(), which causes a deadlock. Thus, this patch changes all generic sched_clock() usage to use the raw_* methods. Acked-by: Linus Torvalds <[email protected]> Reviewed-by: Stephen Boyd <[email protected]> Reported-by: Krzysztof Hałasa <[email protected]> Signed-off-by: John Stultz <[email protected]> Cc: Uwe Kleine-König <[email protected]> Cc: Willy Tarreau <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-12sched: Calculate effective load even if local weight is 0Rik van Riel1-1/+1
Thomas Hellstrom bisected a regression where erratic 3D performance is experienced on virtual machines as measured by glxgears. It identified commit 58d081b5 ("sched/numa: Avoid overloading CPUs on a preferred NUMA node") as the problem which had modified the behaviour of effective_load. Effective load calculates the difference to the system-wide load if a scheduling entity was moved to another CPU. The task group is not heavier as a result of the move but overall system load can increase/decrease as a result of the change. Commit 58d081b5 ("sched/numa: Avoid overloading CPUs on a preferred NUMA node") changed effective_load to make it suitable for calculating if a particular NUMA node was compute overloaded. To reduce the cost of the function, it assumed that a current sched entity weight of 0 was uninteresting but that is not the case. wake_affine() uses a weight of 0 for sync wakeups on the grounds that it is assuming the waking task will sleep and not contribute to load in the near future. In this case, we still want to calculate the effective load of the sched entity hierarchy. As effective_load is no longer used by task_numa_compare since commit fb13c7ee (sched/numa: Use a system-wide search to find swap/migration candidates), this patch simply restores the historical behaviour. Reported-and-tested-by: Thomas Hellstrom <[email protected]> Signed-off-by: Rik van Riel <[email protected]> [ Wrote changelog] Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-01-11workqueue: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()Chuansheng Liu1-0/+2
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to call destroy_work_on_stack() which frees the debug object to pair with INIT_WORK_ONSTACK(). Signed-off-by: Liu, Chuansheng <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2014-01-09ftrace: Synchronize setting function_trace_op with ftrace_trace_functionSteven Rostedt (Red Hat)1-15/+72
ftrace_trace_function is a variable that holds what function will be called directly by the assembly code (mcount). If just a single function is registered and it handles recursion itself, then the assembly will call that function directly without any helper function. It also passes in the ftrace_op that was registered with the callback. The ftrace_op to send is stored in the function_trace_op variable. The ftrace_trace_function and function_trace_op needs to be coordinated such that the called callback wont be called with the wrong ftrace_op, otherwise bad things can happen if it expected a different op. Luckily, there's no callback that doesn't use the helper functions that requires this. But there soon will be and this needs to be fixed. Use a set_function_trace_op to store the ftrace_op to set the function_trace_op to when it is safe to do so (during the update function within the breakpoint or stop machine calls). Or if dynamic ftrace is not being used (static tracing) then we have to do a bit more synchronization when the ftrace_trace_function is set as that takes affect immediately (as oppose to dynamic ftrace doing it with the modification of the trampoline). Signed-off-by: Steven Rostedt <[email protected]>
2014-01-09tracing: Show available event triggers when no trigger is setSteven Rostedt (Red Hat)1-0/+20
Currently there's no way to know what triggers exist on a kernel without looking at the source of the kernel or randomly trying out triggers. Instead of creating another file in the debugfs system, simply show what available triggers are there when cat'ing the trigger file when it has no events: [root /sys/kernel/debug/tracing]# cat events/sched/sched_switch/trigger # Available triggers: # traceon traceoff snapshot stacktrace enable_event disable_event This stays consistent with other debugfs files where meta data like this is always proceeded with a '#' at the start of the line so that tools can strip these out. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt <[email protected]>
2014-01-09tracing: Consolidate event trigger codeSteven Rostedt (Red Hat)2-80/+16
The event trigger code that checks for callback triggers before and after recording of an event has lots of flags checks. This code is duplicated throughout the ftrace events, kprobes and system calls. They all do the exact same checks against the event flags. Added helper functions ftrace_trigger_soft_disabled(), event_trigger_unlock_commit() and event_trigger_unlock_commit_regs() that consolidated the code and these are used instead. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Tom Zanussi <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-01-09tracing: Fix counter for traceon/off event triggersSteven Rostedt (Red Hat)1-2/+8
The counters for the traceon and traceoff are only suppose to decrement when the trigger enables or disables tracing. It is not suppose to decrement every time the event is hit. Only decrement the counter if the trigger actually did something. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-01-06tracing: Remove double-underscore naming in syscall trigger invocationsTom Zanussi1-8/+8
There's no reason to use double-underscores for any variable name in ftrace_syscall_enter()/exit(), since those functions aren't generated and there's no need to avoid namespace collisions as with the event macros, which is where the original invocation code came from. Link: http://lkml.kernel.org/r/0b489c9d1f7ee315cff60fa0e4c2b433ade8ae0d.1389036657.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-01-06tracing/kprobes: Add trace event trigger invocationsTom Zanussi1-6/+36
Add code to the kprobe/kretprobe event functions that will invoke any event triggers associated with a probe's ftrace_event_file. The code to do this is very similar to the invocation code already used to invoke the triggers associated with static events and essentially replaces the existing soft-disable checks with a superset that preserves the original behavior but adds the bits needed to support event triggers. Link: http://lkml.kernel.org/r/f2d49f157b608070045fdb26c9564d5a05a5a7d0.1389036657.git.tom.zanussi@linux.intel.com Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-01-03tracing/probes: Fix build break on !CONFIG_KPROBE_EVENTNamhyung Kim2-8/+8
When kprobe-based dynamic event tracer is not enabled, it caused following build error: kernel/built-in.o: In function `traceprobe_update_arg': (.text+0x10c8dd): undefined reference to `fetch_symbol_u8' kernel/built-in.o: In function `traceprobe_update_arg': (.text+0x10c8e9): undefined reference to `fetch_symbol_u16' kernel/built-in.o: In function `traceprobe_update_arg': (.text+0x10c8f5): undefined reference to `fetch_symbol_u32' kernel/built-in.o: In function `traceprobe_update_arg': (.text+0x10c901): undefined reference to `fetch_symbol_u64' kernel/built-in.o: In function `traceprobe_update_arg': (.text+0x10c909): undefined reference to `fetch_symbol_string' kernel/built-in.o: In function `traceprobe_update_arg': (.text+0x10c913): undefined reference to `fetch_symbol_string_size' ... It was due to the fetch methods are referred from CHECK_FETCH_FUNCS macro and since it was only defined in trace_kprobe.c. Move NULL definition of such fetch functions to the header file. Note, it also requires CONFIG_BRANCH_PROFILING enabled to trigger this failure as well. This is because the "fetch_symbol_*" variables are referenced in a "else if" statement that will only call update_symbol_cache(), which is a static inline stub function when CONFIG_KPROBE_EVENT is not enabled. gcc is smart enough to optimize this "else if" out and that also removes the code that references the undefined variables. But when BRANCH_PROFILING is enabled, it fools gcc into keeping the if statement around and thus references the undefined symbols and fails to build. Reported-by: kbuild test robot <[email protected]> Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-01-02tracing/uprobes: Add @+file_offset fetch methodNamhyung Kim4-1/+62
Enable to fetch data from a file offset. Currently it only supports fetching from same binary uprobe set. It'll translate the file offset to a proper virtual address in the process. The syntax is "@+OFFSET" as it does similar to normal memory fetching (@ADDR) which does no address translation. Suggested-by: Oleg Nesterov <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02uprobes: Allocate ->utask before handler_chain() for tracing handlersOleg Nesterov1-0/+4
uprobe_trace_print() and uprobe_perf_print() need to pass the additional info to call_fetch() methods, currently there is no simple way to do this. current->utask looks like a natural place to hold this info, but we need to allocate it before handler_chain(). This is a bit unfortunate, perhaps we will find a better solution later, but this is simple and should work right now. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/uprobes: Add support for full argument access methodsNamhyung Kim1-12/+22
Enable to fetch other types of argument for the uprobes. IOW, we can access stack, memory, deref, bitfield and retval from uprobes now. The format for the argument types are same as kprobes (but @SYMBOL type is not supported for uprobes), i.e: @ADDR : Fetch memory at ADDR $stackN : Fetch Nth entry of stack (N >= 0) $stack : Fetch stack address $retval : Fetch return value +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address Note that the retval only can be used with uretprobes. Original-patch-by: Hyeoncheol Lee <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Hyeoncheol Lee <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/uprobes: Fetch args before reserving a ring bufferNamhyung Kim1-14/+132
Fetching from user space should be done in a non-atomic context. So use a per-cpu buffer and copy its content to the ring buffer atomically. Note that we can migrate during accessing user memory thus use a per-cpu mutex to protect concurrent accesses. This is needed since we'll be able to fetch args from an user memory which can be swapped out. Before that uprobes could fetch args from registers only which saved in a kernel space. While at it, use __get_data_size() and store_trace_args() to reduce code duplication. And add struct uprobe_cpu_buffer and its helpers as suggested by Oleg. Reviewed-by: Masami Hiramatsu <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/uprobes: Pass 'is_return' to traceprobe_parse_probe_arg()Namhyung Kim1-1/+1
Currently uprobes don't pass is_return to the argument parser so that it cannot make use of "$retval" fetch method since it only works for return probes. Reviewed-by: Masami Hiramatsu <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/probes: Implement 'memory' fetch method for uprobesNamhyung Kim4-81/+129
Use separate method to fetch from memory. Move existing functions to trace_kprobe.c and make them static. Also add new memory fetch implementation for uprobes. Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/probes: Add fetch{,_size} member into deref fetch methodHyeoncheol Lee1-2/+20
The deref fetch methods access a memory region but it assumes that it's a kernel memory since uprobes does not support them. Add ->fetch and ->fetch_size member in order to provide a proper access methods for supporting uprobes. Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Hyeoncheol Lee <[email protected]> [[email protected]: Split original patch into pieces as requested] Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/probes: Move 'symbol' fetch method to kprobesNamhyung Kim4-59/+91
Move existing functions to trace_kprobe.c and add NULL entries to the uprobes fetch type table. I don't make them static since some generic routines like update/free_XXX_fetch_param() require pointers to the functions. Acked-by: Oleg Nesterov <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/probes: Implement 'stack' fetch method for uprobesNamhyung Kim4-26/+66
Use separate method to fetch from stack. Move existing functions to trace_kprobe.c and make them static. Also add new stack fetch implementation for uprobes. Acked-by: Oleg Nesterov <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/probes: Split [ku]probes_fetch_type_tableNamhyung Kim4-39/+119
Use separate fetch_type_table for kprobes and uprobes. It currently shares all fetch methods but some of them will be implemented differently later. This is not to break build if [ku]probes is configured alone (like !CONFIG_KPROBE_EVENT and CONFIG_UPROBE_EVENT). So I added '__weak' to the table declaration so that it can be safely omitted when it configured out. Acked-by: Oleg Nesterov <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/probes: Move fetch function helpers to trace_probe.hNamhyung Kim2-61/+78
Move fetch function helper macros/functions to the header file and make them external. This is preparation of supporting uprobe fetch table in next patch. Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/probes: Integrate duplicate set_print_fmt()Namhyung Kim4-116/+66
The set_print_fmt() functions are implemented almost same for [ku]probes. Move it to a common place and get rid of the duplication. Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/kprobes: Move common functions to trace_probe.hNamhyung Kim2-48/+48
The __get_data_size() and store_trace_args() will be used by uprobes too. Move them to a common location. Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
2014-01-02tracing/uprobes: Convert to struct trace_probeNamhyung Kim1-80/+79
Convert struct trace_uprobe to make use of the common trace_probe structure. Reviewed-by: Masami Hiramatsu <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: zhangwei(Jovi) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>