aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2018-12-11sched/fair: Introduce an energy estimation helper functionQuentin Perret1-0/+76
In preparation for the definition of an energy-aware wakeup path, introduce a helper function to estimate the consequence on system energy when a specific task wakes-up on a specific CPU. compute_energy() estimates the capacity state to be reached by all performance domains and estimates the consumption of each online CPU according to its Energy Model and its percentage of busy time. Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/fair: Add over-utilization/tipping point indicatorMorten Rasmussen2-2/+61
Energy-aware scheduling is only meant to be active while the system is _not_ over-utilized. That is, there are spare cycles available to shift tasks around based on their actual utilization to get a more energy-efficient task distribution without depriving any tasks. When above the tipping point task placement is done the traditional way based on load_avg, spreading the tasks across as many cpus as possible based on priority scaled load to preserve smp_nice. Below the tipping point we want to use util_avg instead. We need to define a criteria for when we make the switch. The util_avg for each cpu converges towards 100% regardless of how many additional tasks we may put on it. If we define over-utilized as: sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity) some individual cpus may be over-utilized running multiple tasks even when the above condition is false. That should be okay as long as we try to spread the tasks out to avoid per-cpu over-utilization as much as possible and if all tasks have the _same_ priority. If the latter isn't true, we have to consider priority to preserve smp_nice. For example, we could have n_cpus nice=-10 util_avg=55% tasks and n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less over-utilized than 55%+60% for those cpus that have to be shared. The system utilization is only 85% of the system capacity, but we are breaking smp_nice. To be sure not to break smp_nice, we have defined over-utilization conservatively as when any cpu in the system is fully utilized at its highest frequency instead: cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg to factor in priority to preserve smp_nice. With this definition, we can skip periodic load-balance as no cpu has an always-running task when the system is not over-utilized. All tasks will be periodic and we can balance them at wake-up. This conservative condition does however mean that some scenarios that could benefit from energy-aware decisions even if one cpu is fully utilized would not get those benefits. For systems where some cpus might have reduced capacity on some cpus (RT-pressure and/or big.LITTLE), we want periodic load-balance checks as soon a just a single cpu is fully utilized as it might one of those with reduced capacity and in that case we want to migrate it. [ peterz: Added a comment explaining why new tasks are not accounted during overutilization detection. ] Signed-off-by: Morten Rasmussen <[email protected]> Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/fair: Clean-up update_sg_lb_stats parametersQuentin Perret2-16/+14
In preparation for the introduction of a new root domain flag which can be set during load balance (the 'overutilized' flag), clean-up the set of parameters passed to update_sg_lb_stats(). More specifically, the 'local_group' and 'local_idx' parameters can be removed since they can easily be reconstructed from within the function. While at it, transform the 'overload' parameter into a flag stored in the 'sg_status' parameter hence facilitating the definition of new flags when needed. Suggested-by: Peter Zijlstra <[email protected]> Suggested-by: Valentin Schneider <[email protected]> Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/toplogy: Introduce the 'sched_energy_present' static keyQuentin Perret2-4/+28
In order to make sure Energy Aware Scheduling (EAS) will not impact systems where no Energy Model is available, introduce a static key guarding the access to EAS code. Since EAS is enabled on a per-root-domain basis, the static key is enabled when at least one root domain meets all conditions for EAS. Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/topology: Make Energy Aware Scheduling depend on schedutilQuentin Perret3-9/+60
Energy Aware Scheduling (EAS) is designed with the assumption that frequencies of CPUs follow their utilization value. When using a CPUFreq governor other than schedutil, the chances of this assumption being true are small, if any. When schedutil is being used, EAS' predictions are at least consistent with the frequency requests. Although those requests have no guarantees to be honored by the hardware, they should at least guide DVFS in the right direction and provide some hope in regards to the EAS model being accurate. To make sure EAS is only used in a sane configuration, create a strong dependency on schedutil being used. Since having sugov compiled-in does not provide that guarantee, make CPUFreq call a scheduler function on governor changes hence letting it rebuild the scheduling domains, check the governors of the online CPUs, and enable/disable EAS accordingly. Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/topology: Disable EAS on inappropriate platformsQuentin Perret1-1/+48
Energy Aware Scheduling (EAS) in its current form is most relevant on platforms with asymmetric CPU topologies (e.g. Arm big.LITTLE) since this is where there is a lot of potential for saving energy through scheduling. This is particularly true since the Energy Model only includes the active power costs of CPUs, hence not providing enough data to compare packing-vs-spreading strategies. As such, disable EAS on root domains where the SD_ASYM_CPUCAPACITY flag is not set. While at it, disable EAS on systems where the complexity of the Energy Model is too high since that could lead to unacceptable scheduling overhead. All in all, EAS can be used on a root domain if and only if: 1. an Energy Model is available; 2. the root domain has an asymmetric CPU capacity topology; 3. the complexity of the root domain's EM is low enough to keep scheduling overheads low. Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/topology: Add lowest CPU asymmetry sched_domain level pointerQuentin Perret3-4/+9
Add another member to the family of per-cpu sched_domain shortcut pointers. This one, sd_asym_cpucapacity, points to the lowest level at which the SD_ASYM_CPUCAPACITY flag is set. While at it, rename the sd_asym shortcut to sd_asym_packing to avoid confusions. Generally speaking, the largest opportunity to save energy via scheduling comes from a smarter exploitation of heterogeneous platforms (i.e. big.LITTLE). Consequently, the sd_asym_cpucapacity shortcut will be used at first as the lowest domain where Energy-Aware Scheduling (EAS) should be applied. For example, it is possible to apply EAS within a socket on a multi-socket system, as long as each socket has an asymmetric topology. Energy-aware cross-sockets wake-up balancing will only happen when the system is over-utilized, or this_cpu and prev_cpu are in different sockets. Suggested-by: Morten Rasmussen <[email protected]> Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/topology: Reference the Energy Model of CPUs when availableQuentin Perret2-4/+151
The existing scheduling domain hierarchy is defined to map to the cache topology of the system. However, Energy Aware Scheduling (EAS) requires more knowledge about the platform, and specifically needs to know about the span of Performance Domains (PD), which do not always align with caches. To address this issue, use the Energy Model (EM) of the system to extend the scheduler topology code with a representation of the PDs, alongside the scheduling domains. More specifically, a linked list of PDs is attached to each root domain. When multiple root domains are in use, each list contains only the PDs covering the CPUs of its root domain. If a PD spans over CPUs of multiple different root domains, it will be duplicated in all lists. The lists are fully maintained by the scheduler from partition_sched_domains() in order to cope with hotplug and cpuset changes. As for scheduling domains, the list are protected by RCU to ensure safe concurrent updates. Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11PM: Introduce an Energy Model management frameworkQuentin Perret3-0/+218
Several subsystems in the kernel (task scheduler and/or thermal at the time of writing) can benefit from knowing about the energy consumed by CPUs. Yet, this information can come from different sources (DT or firmware for example), in different formats, hence making it hard to exploit without a standard API. As an attempt to address this, introduce a centralized Energy Model (EM) management framework which aggregates the power values provided by drivers into a table for each performance domain in the system. The power cost tables are made available to interested clients (e.g. task scheduler or thermal) via platform-agnostic APIs. The overall design is represented by the diagram below (focused on Arm-related drivers as an example, but applicable to any architecture): +---------------+ +-----------------+ +-------------+ | Thermal (IPA) | | Scheduler (EAS) | | Other | +---------------+ +-----------------+ +-------------+ | | em_pd_energy() | | | em_cpu_get() | +-----------+ | +--------+ | | | v v v +---------------------+ | | | Energy Model | | | | Framework | | | +---------------------+ ^ ^ ^ | | | em_register_perf_domain() +----------+ | +---------+ | | | +---------------+ +---------------+ +--------------+ | cpufreq-dt | | arm_scmi | | Other | +---------------+ +---------------+ +--------------+ ^ ^ ^ | | | +--------------+ +---------------+ +--------------+ | Device Tree | | Firmware | | ? | +--------------+ +---------------+ +--------------+ Drivers (typically, but not limited to, CPUFreq drivers) can register data in the EM framework using the em_register_perf_domain() API. The calling driver must provide a callback function with a standardized signature that will be used by the EM framework to build the power cost tables of the performance domain. This design should offer a lot of flexibility to calling drivers which are free of reading information from any location and to use any technique to compute power costs. Moreover, the capacity states registered by drivers in the EM framework are not required to match real performance states of the target. This is particularly important on targets where the performance states are not known by the OS. The power cost coefficients managed by the EM framework are specified in milli-watts. Although the two potential users of those coefficients (IPA and EAS) only need relative correctness, IPA specifically needs to compare the power of CPUs with the power of other components (GPUs, for example), which are still expressed in absolute terms in their respective subsystems. Hence, specifying the power of CPUs in milli-watts should help transitioning IPA to using the EM framework without introducing new problems by keeping units comparable across sub-systems. On the longer term, the EM of other devices than CPUs could also be managed by the EM framework, which would enable to remove the absolute unit. However, this is not absolutely required as a first step, so this extension of the EM framework is left for later. On the client side, the EM framework offers APIs to access the power cost tables of a CPU (em_cpu_get()), and to estimate the energy consumed by the CPUs of a performance domain (em_pd_energy()). Clients such as the task scheduler can then use these APIs to access the shared data structures holding the Energy Model of CPUs. Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/cpufreq: Prepare schedutil for Energy Aware SchedulingQuentin Perret2-15/+68
Schedutil requests frequency by aggregating utilization signals from the scheduler (CFS, RT, DL, IRQ) and applying a 25% margin on top of them. Since Energy Aware Scheduling (EAS) needs to be able to predict the frequency requests, it needs to forecast the decisions made by the governor. In order to prepare the introduction of EAS, introduce schedutil_freq_util() to centralize the aforementioned signal aggregation and make it available to both schedutil and EAS. Since frequency selection and energy estimation still need to deal with RT and DL signals slightly differently, schedutil_freq_util() is called with a different 'type' parameter in those two contexts, and returns an aggregated utilization signal accordingly. While at it, introduce the map_util_freq() function which is designed to make schedutil's 25% margin usable easily for both sugov and EAS. As EAS will be able to predict schedutil's frequency requests more accurately than any other governor by design, it'd be sensible to make sure EAS cannot be used without schedutil. This will be done later, once EAS has actually been introduced. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/topology: Relocate arch_scale_cpu_capacity() to the internal headerQuentin Perret1-18/+0
By default, arch_scale_cpu_capacity() is only visible from within the kernel/sched folder. Relocate it to include/linux/sched/topology.h to make it visible to other clients needing to know about the capacity of CPUs, such as the Energy Model framework. This also shrinks the <linux/sched/topology.h> public header. Signed-off-by: Quentin Perret <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/core: Remove unnecessary unlikely() in push_*_task()Yangtao Li2-6/+2
WARN_ON() already contains an unlikely(), so it's not necessary to use WARN_ON(1). Signed-off-by: Yangtao Li <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/topology: Remove the ::smt_gain field from 'struct sched_domain'Vincent Guittot2-5/+0
::smt_gain is used to compute the capacity of CPUs of a SMT core with the constraint 1 < ::smt_gain < 2 in order to be able to compute number of CPUs per core. The field has_free_capacity of struct numa_stat, which was the last user of this computation of number of CPUs per core, has been removed by: 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()") We can now remove this constraint on core capacity and use the defautl value SCHED_CAPACITY_SCALE for SMT CPUs. With this remove, SCHED_CAPACITY_SCALE becomes the maximum compute capacity of CPUs on every systems. This should help to simplify some code and remove fields like rd->max_cpu_capacity Furthermore, arch_scale_cpu_capacity() is used with a NULL sd in several other places in the code when it wants the capacity of a CPUs to scale some metrics like in pelt, deadline or schedutil. In case on SMT, the value returned is not the capacity of SMT CPUs but default SCHED_CAPACITY_SCALE. So remove it. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11sched/fair: Clean up comment in nohz_idle_balance()Andrea Parri1-3/+1
Concerning the comment associated to the atomic_fetch_andnot() in nohz_idle_balance(), Vincent explains [1]: "[...] the comment is useless and can be removed [...] it was referring to a line code above the comment that was present in a previous iteration of the patchset. This line disappeared in final version but the comment has stayed." So remove the comment. Vincent also points out that the full ordering associated to the atomic_fetch_andnot() primitive could be relaxed, but this patch insists on the current more conservative/fully ordered solution: "Performance" isn't a concern, stay away from "correctness"/subtle relaxed (re)ordering if possible..., just make sure not to confuse the next reader with misleading/out-of-date comments. [1] http://lkml.kernel.org/r/CAKfTPtBjA-oCBRkO6__npQwL3+HLjzk7riCcPU1R7YdO-EpuZg@mail.gmail.com Suggested-by: Vincent Guittot <[email protected]> Signed-off-by: Andrea Parri <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11locking/lockdep: Stop using RCU primitives to access 'all_lock_classes'Bart Van Assche1-4/+5
Due to the previous patch all code that accesses the 'all_lock_classes' list holds the graph lock. Hence use regular list primitives instead of their RCU variants to access this list. Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11locking/lockdep: Make concurrent lockdep_reset_lock() calls safeBart Van Assche1-1/+4
Since zap_class() removes items from the all_lock_classes list and the classhash_table, protect all zap_class() calls against concurrent data structure modifications with the graph lock. Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11locking/lockdep: Remove a superfluous INIT_LIST_HEAD() statementBart Van Assche1-1/+0
Initializing a list entry just before it is passed to list_add_tail_rcu() is not necessary because list_add_tail_rcu() overwrites the next and prev pointers anyway. Hence remove the INIT_LIST_HEAD() statement. Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11locking/lockdep: Introduce lock_class_cache_is_registered()Bart Van Assche1-20/+30
This patch does not change any functionality but makes the lockdep_reset_lock() function easier to read. Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11locking/lockdep: Inline __lockdep_init_map()Bart Van Assche1-7/+1
Since the function __lockdep_init_map() only has one caller, inline it into its caller. This patch does not change any functionality. Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11locking/lockdep: Declare local symbols staticBart Van Assche1-0/+3
This patch avoids that sparse complains about a missing declaration for the lock_classes array when building with CONFIG_DEBUG_LOCKDEP=n. Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-12-11dma-debug: Batch dma_debug_entry allocationRobin Murphy1-29/+21
DMA debug entries are one of those things which aren't that useful individually - we will always want some larger quantity of them - and which we don't really need to manage the exact number of - we only care about having 'enough'. In that regard, the current behaviour of creating them one-by-one leads to a lot of unwarranted function call overhead and memory wasted on alignment padding. Now that we don't have to worry about freeing anything via dma_debug_resize_entries(), we can optimise the allocation behaviour by grabbing whole pages at once, which will save considerably on the aforementioned overheads, and probably offer a little more cache/TLB locality benefit for traversing the lists under normal operation. This should also give even less reason for an architecture-level override of the preallocation size, so make the definition unconditional - if there is still any desire to change the compile-time value for some platforms it would be better off as a Kconfig option anyway. Since freeing a whole page of entries at once becomes enough of a challenge that it's not really worth complicating dma_debug_init(), we may as well tweak the preallocation behaviour such that as long as we manage to allocate *some* pages, we can leave debugging enabled on a best-effort basis rather than otherwise wasting them. Signed-off-by: Robin Murphy <[email protected]> Tested-by: Qian Cai <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-12-11dma/debug: Remove dma_debug_resize_entries()Robin Murphy1-46/+0
With the only caller now gone, we can clean up this part of dma-debug's exposed internals and make way to tweak the allocation behaviour. Signed-off-by: Robin Murphy <[email protected]> Tested-by: Qian Cai <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-12-11dma-debug: Make leak-like behaviour apparentRobin Murphy1-0/+13
Now that we can dynamically allocate DMA debug entries to cope with drivers maintaining excessively large numbers of live mappings, a driver which *does* actually have a bug leaking mappings (and is not unloaded) will no longer trigger the "DMA-API: debugging out of memory - disabling" message until it gets to actual kernel OOM conditions, which means it could go unnoticed for a while. To that end, let's inform the user each time the pool has grown to a multiple of its initial size, which should make it apparent that they either have a leak or might want to increase the preallocation size. Signed-off-by: Robin Murphy <[email protected]> Tested-by: Qian Cai <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-12-11dma-debug: Dynamically expand the dma_debug_entry poolRobin Murphy1-38/+41
Certain drivers such as large multi-queue network adapters can use pools of mapped DMA buffers larger than the default dma_debug_entry pool of 65536 entries, with the result that merely probing such a device can cause DMA debug to disable itself during boot unless explicitly given an appropriate "dma_debug_entries=..." option. Developers trying to debug some other driver on such a system may not be immediately aware of this, and at worst it can hide bugs if they fail to realise that dma-debug has already disabled itself unexpectedly by the time their code of interest gets to run. Even once they do realise, it can be a bit of a pain to emprirically determine a suitable number of preallocated entries to configure, short of massively over-allocating. There's really no need for such a static limit, though, since we can quite easily expand the pool at runtime in those rare cases that the preallocated entries are insufficient, which is arguably the least surprising and most useful behaviour. To that end, refactor the prealloc_memory() logic a little bit to generalise it for runtime reallocations as well. Signed-off-by: Robin Murphy <[email protected]> Tested-by: Qian Cai <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-12-11dma-debug: Use pr_fmt()Robin Murphy1-36/+38
Use pr_fmt() to generate the "DMA-API: " prefix consistently. This results in it being added to a couple of pr_*() messages which were missing it before, and for the err_printk() calls moves it to the actual start of the message instead of somewhere in the middle. Signed-off-by: Robin Murphy <[email protected]> Tested-by: Qian Cai <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-12-11dma-debug: Expose nr_total_entries in debugfsRobin Murphy1-0/+7
Expose nr_total_entries in debugfs, so that {num,min}_free_entries become even more meaningful to users interested in current/maximum utilisation. This becomes even more relevant once nr_total_entries may change at runtime beyond just the existing AMD GART debug code. Suggested-by: John Garry <[email protected]> Signed-off-by: Robin Murphy <[email protected]> Tested-by: Qian Cai <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-12-11sched/cpufreq: Add the SPDX tagsDaniel Lezcano2-8/+2
The SPDX tags are not present in cpufreq.c and cpufreq_schedutil.c. Add them and remove the license descriptions Signed-off-by: Daniel Lezcano <[email protected]> Acked-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2018-12-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller4-99/+458
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-12-11 The following pull-request contains BPF updates for your *net-next* tree. It has three minor merge conflicts, resolutions: 1) tools/testing/selftests/bpf/test_verifier.c Take first chunk with alignment_prevented_execution. 2) net/core/filter.c [...] case bpf_ctx_range_ptr(struct __sk_buff, flow_keys): case bpf_ctx_range(struct __sk_buff, wire_len): return false; [...] 3) include/uapi/linux/bpf.h Take the second chunk for the two cases each. The main changes are: 1) Add support for BPF line info via BTF and extend libbpf as well as bpftool's program dump to annotate output with BPF C code to facilitate debugging and introspection, from Martin. 2) Add support for BPF_ALU | BPF_ARSH | BPF_{K,X} in interpreter and all JIT backends, from Jiong. 3) Improve BPF test coverage on archs with no efficient unaligned access by adding an "any alignment" flag to the BPF program load to forcefully disable verifier alignment checks, from David. 4) Add a new bpf_prog_test_run_xattr() API to libbpf which allows for proper use of BPF_PROG_TEST_RUN with data_out, from Lorenz. 5) Extend tc BPF programs to use a new __sk_buff field called wire_len for more accurate accounting of packets going to wire, from Petar. 6) Improve bpftool to allow dumping the trace pipe from it and add several improvements in bash completion and map/prog dump, from Quentin. 7) Optimize arm64 BPF JIT to always emit movn/movk/movk sequence for kernel addresses and add a dedicated BPF JIT backend allocator, from Ard. 8) Add a BPF helper function for IR remotes to report mouse movements, from Sean. 9) Various cleanups in BPF prog dump e.g. to make UAPI bpf_prog_info member naming consistent with existing conventions, from Yonghong and Song. 10) Misc cleanups and improvements in allowing to pass interface name via cmdline for xdp1 BPF example, from Matteo. 11) Fix a potential segfault in BPF sample loader's kprobes handling, from Daniel T. 12) Fix SPDX license in libbpf's README.rst, from Andrey. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-12-10bpf: rename *_info_cnt to nr_*_info in bpf_prog_infoYonghong Song1-19/+19
In uapi bpf.h, currently we have the following fields in the struct bpf_prog_info: __u32 func_info_cnt; __u32 line_info_cnt; __u32 jited_line_info_cnt; The above field names "func_info_cnt" and "line_info_cnt" also appear in union bpf_attr for program loading. The original intention is to keep the names the same between bpf_prog_info and bpf_attr so it will imply what we returned to user space will be the same as what the user space passed to the kernel. Such a naming convention in bpf_prog_info is not consistent with other fields like: __u32 nr_jited_ksyms; __u32 nr_jited_func_lens; This patch made this adjustment so in bpf_prog_info newly introduced *_info_cnt becomes nr_*_info. Acked-by: Song Liu <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-10bpf: clean up bpf_prog_get_info_by_fd()Song Liu1-2/+2
info.nr_jited_ksyms and info.nr_jited_func_lens cannot be 0 in these two statements, so we don't need to check them. Signed-off-by: Song Liu <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-10bpf: relax verifier restriction on BPF_MOV | BPF_ALUJiong Wang1-4/+12
Currently, the destination register is marked as unknown for 32-bit sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is SCALAR_VALUE. This is too conservative that some valid cases will be rejected. Especially, this may turn a constant scalar value into unknown value that could break some assumptions of verifier. For example, test_l4lb_noinline.c has the following C code: struct real_definition *dst 1: if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6)) 2: return TC_ACT_SHOT; 3: 4: if (dst->flags & F_IPV6) { get_packet_dst is responsible for initializing "dst" into valid pointer and return true (1), otherwise return false (0). The compiled instruction sequence using alu32 will be: 412: (54) (u32) r7 &= (u32) 1 413: (bc) (u32) r0 = (u32) r7 414: (95) exit insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even r7 contains SCALAR_VALUE 1. This causes trouble when verifier is walking the code path that hasn't initialized "dst" inside get_packet_dst, for which case 0 is returned and we would then expect verifier concluding line 1 in the above C code pass the "if" check, therefore would skip fall through path starting at line 4. Now, because r0 returned from callee has became unknown value, so verifier won't skip analyzing path starting at line 4 and "dst->flags" requires dereferencing the pointer "dst" which actually hasn't be initialized for this path. This patch relaxed the code marking sub-register move destination. For a SCALAR_VALUE, it is safe to just copy the value from source then truncate it into 32-bit. A unit test also included to demonstrate this issue. This test will fail before this patch. This relaxation could let verifier skipping more paths for conditional comparison against immediate. It also let verifier recording a more accurate/strict value for one register at one state, if this state end up with going through exit without rejection and it is used for state comparison later, then it is possible an inaccurate/permissive value is better. So the real impact on verifier processed insn number is complex. But in all, without this fix, valid program could be rejected. >From real benchmarking on kernel selftests and Cilium bpf tests, there is no impact on processed instruction number when tests ares compiled with default compilation options. There is slightly improvements when they are compiled with -mattr=+alu32 after this patch. Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is rejected before this fix. Insn processed before/after this patch: default -mattr=+alu32 Kernel selftest === test_xdp.o 371/371 369/369 test_l4lb.o 6345/6345 5623/5623 test_xdp_noinline.o 2971/2971 rejected/2727 test_tcp_estates.o 429/429 430/430 Cilium bpf === bpf_lb-DLB_L3.o: 2085/2085 1685/1687 bpf_lb-DLB_L4.o: 2287/2287 1986/1982 bpf_lb-DUNKNOWN.o: 690/690 622/622 bpf_lxc.o: 95033/95033 N/A bpf_netdev.o: 7245/7245 N/A bpf_overlay.o: 2898/2898 3085/2947 NOTE: - bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by verifier due to another issue inside verifier on supporting alu32 binary. - Each cilium bpf program could generate several processed insn number, above number is sum of them. v1->v2: - Restrict the change on SCALAR_VALUE. - Update benchmark numbers on Cilium bpf tests. Signed-off-by: Jiong Wang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-10fgraph: Add comment to describe ftrace_graph_get_ret_stackSteven Rostedt (VMware)1-0/+11
The ret_stack should not be accessed directly via the curr_ret_stack variable on the task_struct. This is because the ret_stack is going to be converted into a series of longs and not an array of ret_stack structures. The way that a ret_stack should be retrieved is via the ftrace_graph_get_ret_stack structure, but it needs to be documented on how to use it. Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-10ftrace: Allow ftrace_replace_code() to be schedulableSteven Rostedt (VMware)1-3/+16
The function ftrace_replace_code() is the ftrace engine that does the work to modify all the nops into the calls to the function callback in all the functions being traced. The generic version which is normally called from stop machine, but an architecture can implement a non stop machine version and still use the generic ftrace_replace_code(). When an architecture does this, ftrace_replace_code() may be called from a schedulable context, where it can allow the code to be preemptible, and schedule out. In order to allow an architecture to make ftrace_replace_code() schedulable, a new command flag is added called: FTRACE_MAY_SLEEP Which can be or'd to the command that is passed to ftrace_modify_all_code() that calls ftrace_replace_code() and will have it call cond_resched() in the loop that modifies the nops into the calls to the ftrace trampolines. Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Reported-by: Anders Roxell <[email protected]> Tested-by: Anders Roxell <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-10tracing: Add generic event-name based remove event methodMasami Hiramatsu1-4/+11
Add a generic method to remove event from dynamic event list. This is same as other system under ftrace. You just need to pass the event name with '!', e.g. # echo p:new_grp/new_event _do_fork > dynamic_events This creates an event, and # echo '!p:new_grp/new_event _do_fork' > dynamic_events Or, # echo '!p:new_grp/new_event' > dynamic_events will remove new_grp/new_event event. Note that this doesn't check the event prefix (e.g. "p:") strictly, because the "group/event" name must be unique. Link: http://lkml.kernel.org/r/154140869774.17322.8887303560398645347.stgit@devbox Reviewed-by: Tom Zanussi <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-10tracing: Consolidate trace_add/remove_event_call back to the nolock functionsSteven Rostedt (VMware)4-33/+11
The trace_add/remove_event_call_nolock() functions were added to allow the tace_add/remove_event_call() code be called when the event_mutex lock was already taken. Now that all callers are done within the event_mutex, there's no reason to have two different interfaces. Remove the current wrapper trace_add/remove_event_call()s and rename the _nolock versions back to the original names. Link: http://lkml.kernel.org/r/154140866955.17322.2081425494660638846.stgit@devbox Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-10printk: fix printk_time race.Tetsuo Handa1-31/+39
Since printk_time can be toggled via /sys/module/printk/parameters/time , it is not safe to assume that output length does not change across multiple msg_print_text() calls. If we hit this race, we can observe failures such as SYSLOG_ACTION_READ_ALL writes more bytes than userspace has supplied, SYSLOG_ACTION_SIZE_UNREAD returns -EFAULT when succeeded, SYSLOG_ACTION_READ reads garbage memory or even triggers an kernel oops at _copy_to_user() due to integer overflow. To close this race, get a snapshot value of printk_time and pass it to SYSLOG_ACTION_READ, SYSLOG_ACTION_READ_ALL, SYSLOG_ACTION_SIZE_UNREAD and kmsg_dump_get_buffer(). Link: http://lkml.kernel.org/r/[email protected] To: Sergey Senozhatsky <[email protected]> Cc: [email protected] Signed-off-by: Tetsuo Handa <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2018-12-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller16-76/+315
Several conflicts, seemingly all over the place. I used Stephen Rothwell's sample resolutions for many of these, if not just to double check my own work, so definitely the credit largely goes to him. The NFP conflict consisted of a bug fix (moving operations past the rhashtable operation) while chaning the initial argument in the function call in the moved code. The net/dsa/master.c conflict had to do with a bug fix intermixing of making dsa_master_set_mtu() static with the fixing of the tagging attribute location. cls_flower had a conflict because the dup reject fix from Or overlapped with the addition of port range classifiction. __set_phy_supported()'s conflict was relatively easy to resolve because Andrew fixed it in both trees, so it was just a matter of taking the net-next copy. Or at least I think it was :-) Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup() intermixed with changes on how the sdif and caller_net are calculated in these code paths in net-next. The remaining BPF conflicts were largely about the addition of the __bpf_md_ptr stuff in 'net' overlapping with adjustments and additions to the relevant data structure where the MD pointer macros are used. Signed-off-by: David S. Miller <[email protected]>
2018-12-09Merge tag 'v4.20-rc6' into for-4.21/blockJens Axboe4-15/+174
Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the two corruption fixes. We're going to be overhauling the direct dispatch path, and we need to do that on top of the changes we made for that in mainline. Signed-off-by: Jens Axboe <[email protected]>
2018-12-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2-14/+171
Pull networking fixes from David Miller: "A decent batch of fixes here. I'd say about half are for problems that have existed for a while, and half are for new regressions added in the 4.20 merge window. 1) Fix 10G SFP phy module detection in mvpp2, from Baruch Siach. 2) Revert bogus emac driver change, from Benjamin Herrenschmidt. 3) Handle BPF exported data structure with pointers when building 32-bit userland, from Daniel Borkmann. 4) Memory leak fix in act_police, from Davide Caratti. 5) Check RX checksum offload in RX descriptors properly in aquantia driver, from Dmitry Bogdanov. 6) SKB unlink fix in various spots, from Edward Cree. 7) ndo_dflt_fdb_dump() only works with ethernet, enforce this, from Eric Dumazet. 8) Fix FID leak in mlxsw driver, from Ido Schimmel. 9) IOTLB locking fix in vhost, from Jean-Philippe Brucker. 10) Fix SKB truesize accounting in ipv4/ipv6/netfilter frag memory limits otherwise namespace exit can hang. From Jiri Wiesner. 11) Address block parsing length fixes in x25 from Martin Schiller. 12) IRQ and ring accounting fixes in bnxt_en, from Michael Chan. 13) For tun interfaces, only iface delete works with rtnl ops, enforce this by disallowing add. From Nicolas Dichtel. 14) Use after free in liquidio, from Pan Bian. 15) Fix SKB use after passing to netif_receive_skb(), from Prashant Bhole. 16) Static key accounting and other fixes in XPS from Sabrina Dubroca. 17) Partially initialized flow key passed to ip6_route_output(), from Shmulik Ladkani. 18) Fix RTNL deadlock during reset in ibmvnic driver, from Thomas Falcon. 19) Several small TCP fixes (off-by-one on window probe abort, NULL deref in tail loss probe, SNMP mis-estimations) from Yuchung Cheng" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (93 commits) net/sched: cls_flower: Reject duplicated rules also under skip_sw bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips. bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips. bnxt_en: Keep track of reserved IRQs. bnxt_en: Fix CNP CoS queue regression. net/mlx4_core: Correctly set PFC param if global pause is turned off. Revert "net/ibm/emac: wrong bit is used for STA control" neighbour: Avoid writing before skb->head in neigh_hh_output() ipv6: Check available headroom in ip6_xmit() even without options tcp: lack of available data can also cause TSO defer ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output mlxsw: spectrum_switchdev: Fix VLAN device deletion via ioctl mlxsw: spectrum_router: Relax GRE decap matching check mlxsw: spectrum_switchdev: Avoid leaking FID's reference count mlxsw: spectrum_nve: Remove easily triggerable warnings ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes sctp: frag_point sanity check tcp: fix NULL ref in tail loss probe tcp: Do not underestimate rwnd_limited net: use skb_list_del_init() to remove from RX sublists ...
2018-12-09bpf: Add bpf_line_info supportMartin KaFai Lau4-33/+368
This patch adds bpf_line_info support. It accepts an array of bpf_line_info objects during BPF_PROG_LOAD. The "line_info", "line_info_cnt" and "line_info_rec_size" are added to the "union bpf_attr". The "line_info_rec_size" makes bpf_line_info extensible in the future. The new "check_btf_line()" ensures the userspace line_info is valid for the kernel to use. When the verifier is translating/patching the bpf_prog (through "bpf_patch_insn_single()"), the line_infos' insn_off is also adjusted by the newly added "bpf_adj_linfo()". If the bpf_prog is jited, this patch also provides the jited addrs (in aux->jited_linfo) for the corresponding line_info.insn_off. "bpf_prog_fill_jited_linfo()" is added to fill the aux->jited_linfo. It is currently called by the x86 jit. Other jits can also use "bpf_prog_fill_jited_linfo()" and it will be done in the followup patches. In the future, if it deemed necessary, a particular jit could also provide its own "bpf_prog_fill_jited_linfo()" implementation. A few "*line_info*" fields are added to the bpf_prog_info such that the user can get the xlated line_info back (i.e. the line_info with its insn_off reflecting the translated prog). The jited_line_info is available if the prog is jited. It is an array of __u64. If the prog is not jited, jited_line_info_cnt is 0. The verifier's verbose log with line_info will be done in a follow up patch. Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-12-08tracing: Remove unneeded synth_event_mutexMasami Hiramatsu1-23/+7
Rmove unneeded synth_event_mutex. This mutex protects the reference count in synth_event, however, those operational points are already protected by event_mutex. 1. In __create_synth_event() and create_or_delete_synth_event(), those synth_event_mutex clearly obtained right after event_mutex. 2. event_hist_trigger_func() is trigger_hist_cmd.func() which is called by trigger_process_regex(), which is a part of event_trigger_regex_write() and this function takes event_mutex. 3. hist_unreg_all() is trigger_hist_cmd.unreg_all() which is called by event_trigger_regex_open() and it takes event_mutex. 4. onmatch_destroy() and onmatch_create() have long call tree, but both are finally invoked from event_trigger_regex_write() and event_trace_del_tracer(), former takes event_mutex, and latter ensures called under event_mutex locked. Finally, I ensured there is no resource conflict. For safety, I added lockdep_assert_held(&event_mutex) for each function. Link: http://lkml.kernel.org/r/154140864134.17322.4796059721306031894.stgit@devbox Reviewed-by: Tom Zanussi <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-08tracing: Use dyn_event framework for synthetic eventsMasami Hiramatsu3-98/+176
Use dyn_event framework for synthetic events. This shows synthetic events on "tracing/dynamic_events" file in addition to tracing/synthetic_events interface. User can also define new events via tracing/dynamic_events with "s:" prefix. So, the new syntax is below; s:[synthetic/]EVENT_NAME TYPE ARG; [TYPE ARG;]... To remove events via tracing/dynamic_events, you can use "-:" prefix as same as other events. Link: http://lkml.kernel.org/r/154140861301.17322.15454611233735614508.stgit@devbox Reviewed-by: Tom Zanussi <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-08tracing/uprobes: Use dyn_event framework for uprobe eventsMasami Hiramatsu2-130/+149
Use dyn_event framework for uprobe events. This shows uprobe events on "dynamic_events" file. User can also define new uprobe events via dynamic_events. Link: http://lkml.kernel.org/r/154140858481.17322.9091293846515154065.stgit@devbox Reviewed-by: Tom Zanussi <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-08tracing/kprobes: Use dyn_event framework for kprobe eventsMasami Hiramatsu4-145/+204
Use dyn_event framework for kprobe events. This shows kprobe events on "tracing/dynamic_events" file. User can also define new events via tracing/dynamic_events. Link: http://lkml.kernel.org/r/154140855646.17322.6619219995865980392.stgit@devbox Reviewed-by: Tom Zanussi <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-08tracing: Add unified dynamic event frameworkMasami Hiramatsu5-0/+337
Add unified dynamic event framework for ftrace kprobes, uprobes and synthetic events. Those dynamic events can be co-exist on same file because those syntax doesn't overlap. This introduces a framework part which provides a unified tracefs interface and operations. Link: http://lkml.kernel.org/r/154140852824.17322.12250362185969352095.stgit@devbox Reviewed-by: Tom Zanussi <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-08tracing: Integrate similar probe argument parsersMasami Hiramatsu4-96/+50
Integrate similar argument parsers for kprobes and uprobes events into traceprobe_parse_probe_arg(). Link: http://lkml.kernel.org/r/154140850016.17322.9836787731210512176.stgit@devbox Reviewed-by: Tom Zanussi <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-08tracing: Simplify creation and deletion of synthetic eventsMasami Hiramatsu1-35/+18
Since the event_mutex and synth_event_mutex ordering issue is gone, we can skip existing event check when adding or deleting events, and some redundant code in error path. This changes release_all_synth_events() to abort the process when it hits any error and returns the error code. It succeeds only if it has no error. Link: http://lkml.kernel.org/r/154140847194.17322.17960275728005067803.stgit@devbox Reviewed-by: Tom Zanussi <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-08tracing: Lock event_mutex before synth_event_mutexMasami Hiramatsu2-20/+38
synthetic event is using synth_event_mutex for protecting synth_event_list, and event_trigger_write() path acquires locks as below order. event_trigger_write(event_mutex) ->trigger_process_regex(trigger_cmd_mutex) ->event_hist_trigger_func(synth_event_mutex) On the other hand, synthetic event creation and deletion paths call trace_add_event_call() and trace_remove_event_call() which acquires event_mutex. In that case, if we keep the synth_event_mutex locked while registering/unregistering synthetic events, its dependency will be inversed. To avoid this issue, current synthetic event is using a 2 phase process to create/delete events. For example, it searches existing events under synth_event_mutex to check for event-name conflicts, and unlocks synth_event_mutex, then registers a new event under event_mutex locked. Finally, it locks synth_event_mutex and tries to add the new event to the list. But it can introduce complexity and a chance for name conflicts. To solve this simpler, this introduces trace_add_event_call_nolock() and trace_remove_event_call_nolock() which don't acquire event_mutex inside. synthetic event can lock event_mutex before synth_event_mutex to solve the lock dependency issue simpler. Link: http://lkml.kernel.org/r/154140844377.17322.13781091165954002713.stgit@devbox Reviewed-by: Tom Zanussi <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-08tracing/uprobes: Add busy check when cleanup all uprobesMasami Hiramatsu1-0/+7
Add a busy check loop in cleanup_all_probes() before trying to remove all events in uprobe_events, the same way that kprobe_events does. Without this change, writing null to uprobe_events will try to remove events but if one of them is enabled, it will stop there leaving some events cleared and others not clceared. With this change, writing null to uprobe_events makes sure all events are not enabled before removing events. So, it clears all events, or returns an error (-EBUSY) with keeping all events. Link: http://lkml.kernel.org/r/154140841557.17322.12653952888762532401.stgit@devbox Reviewed-by: Tom Zanussi <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-12-08tracing: Change default buffer_percent to 50Steven Rostedt (VMware)1-1/+1
After running several tests, it appears that having the reader wait till half the buffer is full before starting to read (and causing its own events to fill up the ring buffer constantly), works well. It keeps trace-cmd (the main user of this interface) from dominating the traces it records. Signed-off-by: Steven Rostedt (VMware) <[email protected]>