aboutsummaryrefslogtreecommitdiff
path: root/include/linux/sched.h
AgeCommit message (Collapse)AuthorFilesLines
2016-10-07mm,oom_reaper: do not attempt to reap a task twiceTetsuo Handa1-1/+0
"mm, oom_reaper: do not attempt to reap a task twice" tried to give the OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag. But the usefulness of the flag is rather limited and actually never shown in practice. If the flag is set, it means that the holder of mm->mmap_sem cannot call up_write() due to presumably being blocked at unkillable wait waiting for other thread's memory allocation. But since one of threads sharing that mm will queue that mm immediately via task_will_free_mem() shortcut (otherwise, oom_badness() will select the same mm again due to oom_score_adj value unchanged), retrying MMF_OOM_NOT_REAPABLE mm is unlikely helpful. Let's always set MMF_OOM_REAPED. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tetsuo Handa <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: David Rientjes <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-03Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds1-2/+79
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull low-level x86 updates from Ingo Molnar: "In this cycle this topic tree has become one of those 'super topics' that accumulated a lot of changes: - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on x86 - preceded by an array of changes. v4.8 saw preparatory changes in this area already - this is the rest of the work. Includes the thread stack caching performance optimization. (Andy Lutomirski) - switch_to() cleanups and all around enhancements. (Brian Gerst) - A large number of dumpstack infrastructure enhancements and an unwinder abstraction. The secret long term plan is safe(r) live patching plus maybe another attempt at debuginfo based unwinding - but all these current bits are standalone enhancements in a frame pointer based debug environment as well. (Josh Poimboeuf) - More __ro_after_init and const annotations. (Kees Cook) - Enable KASLR for the vmemmap memory region. (Thomas Garnier)" [ The virtually mapped stack changes are pretty fundamental, and not x86-specific per se, even if they are only used on x86 right now. ] * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits) x86/asm: Get rid of __read_cr4_safe() thread_info: Use unsigned long for flags x86/alternatives: Add stack frame dependency to alternative_call_2() x86/dumpstack: Fix show_stack() task pointer regression x86/dumpstack: Remove dump_trace() and related callbacks x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder oprofile/x86: Convert x86_backtrace() to use the new unwinder x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder perf/x86: Convert perf_callchain_kernel() to use the new unwinder x86/unwind: Add new unwind interface and implementations x86/dumpstack: Remove NULL task pointer convention fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK lib/syscall: Pin the task stack in collect_syscall() x86/process: Pin the target stack in get_wchan() x86/dumpstack: Pin the target stack when dumping it kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function sched/core: Add try_get_task_stack() and put_task_stack() x86/entry/64: Fix a minor comment rebase error iommu/amd: Don't put completion-wait semaphore on stack ...
2016-10-03Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-2/+28
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes are: - irqtime accounting cleanups and enhancements. (Frederic Weisbecker) - schedstat debugging enhancements, make it more broadly runtime available. (Josh Poimboeuf) - More work on asymmetric topology/capacity scheduling. (Morten Rasmussen) - sched/wait fixes and cleanups. (Oleg Nesterov) - PELT (per entity load tracking) improvements. (Peter Zijlstra) - Rewrite and enhance select_idle_siblings(). (Peter Zijlstra) - sched/numa enhancements/fixes (Rik van Riel) - sched/cputime scalability improvements (Stanislaw Gruszka) - Load calculation arithmetics fixes. (Dietmar Eggemann) - sched/deadline enhancements (Tommaso Cucinotta) - Fix utilization accounting when switching to the SCHED_NORMAL policy. (Vincent Guittot) - ... plus misc cleanups and enhancements" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched/irqtime: Consolidate irqtime flushing code sched/irqtime: Consolidate accounting synchronization with u64_stats API u64_stats: Introduce IRQs disabled helpers sched/irqtime: Remove needless IRQs disablement on kcpustat update sched/irqtime: No need for preempt-safe accessors sched/fair: Fix min_vruntime tracking sched/debug: Add SCHED_WARN_ON() sched/core: Fix set_user_nice() sched/fair: Introduce set_curr_task() helper sched/core, ia64: Rename set_curr_task() sched/core: Fix incorrect utilization accounting when switching to fair class sched/core: Optimize SCHED_SMT sched/core: Rewrite and improve select_idle_siblings() sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared sched/core: Introduce 'struct sched_domain_shared' sched/core: Restructure destroy_sched_domain() sched/core: Remove unused @cpu argument from destroy_sched_domain*() sched/wait: Introduce init_wait_entry() sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock() sched/wait: Avoid abort_exclusive_wait() in ___wait_event() ...
2016-09-30sched/core, ia64: Rename set_curr_task()Peter Zijlstra1-1/+1
Rename the ia64 only set_curr_task() function to free up the name. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-09-30sched/core: Rewrite and improve select_idle_siblings()Peter Zijlstra1-0/+3
select_idle_siblings() is a known pain point for a number of workloads; it either does too much or not enough and sometimes just does plain wrong. This rewrite attempts to address a number of issues (but sadly not all). The current code does an unconditional sched_domain iteration; with the intent of finding an idle core (on SMT hardware). The problems which this patch tries to address are: - its pointless to look for idle cores if the machine is real busy; at which point you're just wasting cycles. - it's behaviour is inconsistent between SMT and !SMT hardware in that !SMT hardware ends up doing a scan for any idle CPU in the LLC domain, while SMT hardware does a scan for idle cores and if that fails, falls back to a scan for idle threads on the 'target' core. The new code replaces the sched_domain scan with 3 explicit scans: 1) search for an idle core in the LLC 2) search for an idle CPU in the LLC 3) search for an idle thread in the 'target' core where 1 and 3 are conditional on SMT support and 1 and 2 have runtime heuristics to skip the step. Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT siblings of the CPU going idle. Similarly, we clear sd_llc_shared->has_idle_cores when we fail to find an idle core. Step 2) tracks the average cost of the scan and compares this to the average idle time guestimate for the CPU doing the wakeup. There is a significant fudge factor involved to deal with the variability of the averages. Esp. hackbench was sensitive to this. Step 3) is unconditional; we assume (also per step 1) that scanning all SMT siblings in a core is 'cheap'. With this; SMT systems gain step 2, which cures a few benchmarks -- notably one from Facebook. One 'feature' of the sched_domain iteration, which we preserve in the new code, is that it would start scanning from the 'target' CPU, instead of scanning the cpumask in cpu id order. This avoids multiple CPUs in the LLC scanning for idle to gang up and find the same CPU quite as much. The down side is that tasks can end up hopping across the LLC for no apparent reason. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-09-30sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_sharedPeter Zijlstra1-0/+1
Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc location into the much more natural sched_domain_shared location. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-09-30sched/core: Introduce 'struct sched_domain_shared'Peter Zijlstra1-0/+6
Since struct sched_domain is strictly per cpu; introduce a structure that is shared between all 'identical' sched_domains. Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it for shared cache state; if another use comes up later we can easily relax this. While the sched_group's are normally shared between CPUs, these are not natural to use when we need some shared state on a domain level -- since that would require the domain to have a parent, which is not a given. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-09-22sched/core: Avoid _cond_resched() for PREEMPT=yPeter Zijlstra1-0/+4
On fully preemptible kernels _cond_resched() is pointless, so avoid emitting any code for it. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mikulas Patocka <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-09-22sched/core: Optimize __schedule()Peter Zijlstra1-0/+2
Oleg noted that by making do_exit() use __schedule() for the TASK_DEAD context switch, we can avoid the TASK_DEAD special case currently in __schedule() because that avoids the extra preempt_disable() from schedule(). In order to facilitate this, create a do_task_dead() helper which we place in the scheduler code, such that it can access __schedule(). Also add some __noreturn annotations to the functions, there's no coming back from do_exit(). Suggested-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Cheng Chao <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-09-16sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASKAndy Lutomirski1-0/+14
We currently keep every task's stack around until the task_struct itself is freed. This means that we keep the stack allocation alive for longer than necessary and that, under load, we free stacks in big batches whenever RCU drops the last task reference. Neither of these is good for reuse of cache-hot memory, and freeing in batches prevents us from usefully caching small numbers of vmalloced stacks. On architectures that have thread_info on the stack, we can't easily change this, but on architectures that set THREAD_INFO_IN_TASK, we can free it as soon as the task is dead. Signed-off-by: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jann Horn <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.org Signed-off-by: Ingo Molnar <[email protected]>
2016-09-16sched/core: Add try_get_task_stack() and put_task_stack()Andy Lutomirski1-0/+16
There are a few places in the kernel that access stack memory belonging to a different task. Before we can start freeing task stacks before the task_struct is freed, we need a way for those code paths to pin the stack. Signed-off-by: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jann Horn <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/17a434f50ad3d77000104f21666575e10a9c1fbd.1474003868.git.luto@kernel.org Signed-off-by: Ingo Molnar <[email protected]>
2016-09-15sched/core: Allow putting thread_info into task_structAndy Lutomirski1-2/+34
If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT, then thread_info is defined as a single 'u32 flags' and is the first entry of task_struct. thread_info::task is removed (it serves no purpose if thread_info is embedded in task_struct), and thread_info::cpu gets its own slot in task_struct. This is heavily based on a patch written by Linus. Originally-from: Linus Torvalds <[email protected]> Signed-off-by: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jann Horn <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org Signed-off-by: Ingo Molnar <[email protected]>
2016-09-13cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait conditionRafael J. Wysocki1-0/+1
Testing indicates that it is possible to improve performace significantly without increasing energy consumption too much by teaching cpufreq governors to bump up the CPU performance level if the in_iowait flag is set for the task in enqueue_task_fair(). For this purpose, define a new cpufreq_update_util() flag SCHED_CPUFREQ_IOWAIT and modify enqueue_task_fair() to pass that flag to cpufreq_update_util() in the in_iowait case. That generally requires cpufreq_update_util() to be called directly from there, because update_load_avg() may not be invoked in that case. Signed-off-by: Rafael J. Wysocki <[email protected]> Looks-good-to: Steve Muckle <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2016-08-24fork: Add generic vmalloced stack supportAndy Lutomirski1-0/+15
If CONFIG_VMAP_STACK=y is selected, kernel stacks are allocated with __vmalloc_node_range(). Grsecurity has had a similar feature (called GRKERNSEC_KSTACKOVERFLOW=y) for a long time. Signed-off-by: Andy Lutomirski <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/14c07d4fd173a5b117f51e8b939f9f4323e39899.1470907718.git.luto@kernel.org [ Minor edits. ] Signed-off-by: Ingo Molnar <[email protected]>
2016-08-18sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flagMorten Rasmussen1-0/+1
Add a topology flag to the sched_domain hierarchy indicating the lowest domain level where the full range of CPU capacities is represented by the domain members for asymmetric capacity topologies (e.g. ARM big.LITTLE). The flag is intended to indicate that extra care should be taken when placing tasks on CPUs and this level spans all the different types of CPUs found in the system (no need to look further up the domain hierarchy). This information is currently only available through iterating through the capacities of all the CPUs at parent levels in the sched_domain hierarchy. SD 2 [ 0 1 2 3] SD_ASYM_CPUCAPACITY SD 1 [ 0 1] [ 2 3] !SD_ASYM_CPUCAPACITY CPU: 0 1 2 3 capacity: 756 756 1024 1024 If the topology in the example above is duplicated to create an eight CPU example with third sched_domain level on top (SD 3), this level should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group would both have all CPU capacities represented within them. Signed-off-by: Morten Rasmussen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-08-16cpufreq / sched: Pass flags to cpufreq_update_util()Rafael J. Wysocki1-4/+8
It is useful to know the reason why cpufreq_update_util() has just been called and that can be passed as flags to cpufreq_update_util() and to the ->func() callback in struct update_util_data. However, doing that in addition to passing the util and max arguments they already take would be clumsy, so avoid it. Instead, use the observation that the schedutil governor is part of the scheduler proper, so it can access scheduler data directly. This allows the util and max arguments of cpufreq_update_util() and the ->func() callback in struct update_util_data to be replaced with a flags one, but schedutil has to be modified to follow. Thus make the schedutil governor obtain the CFS utilization information from the scheduler and use the "RT" and "DL" flags instead of the special utilization value of ULONG_MAX to track updates from the RT and DL sched classes. Make it non-modular too to avoid having to export scheduler variables to modules at large. Next, update all of the other users of cpufreq_update_util() and the ->func() callback in struct update_util_data accordingly. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Viresh Kumar <[email protected]>
2016-08-10sched/debug: Make the "Preemption disabled at ..." message more usefulVegard Nossum1-0/+9
This message is currently really useless since it always prints a value that comes from the printk() we just did, e.g.: BUG: sleeping function called from invalid context at mm/slab.h:388 in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1 Preemption disabled at:[<ffffffff8119db33>] down_trylock+0x13/0x80 BUG: sleeping function called from invalid context at include/linux/freezer.h:56 in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1 Preemption disabled at:[<ffffffff811aaa37>] console_unlock+0x2f7/0x930 Here, both down_trylock() and console_unlock() is somewhere in the printk() path. We should save the value before calling printk() and use the saved value instead. That immediately reveals the offending callsite: BUG: sleeping function called from invalid context at mm/slab.h:388 in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2 Preemption disabled at:[<ffffffff819bcd46>] rhashtable_walk_start+0x46/0x150 Bug report: http://marc.info/?l=linux-netdev&m=146925979821849&w=2 Signed-off-by: Vegard Nossum <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rusty Russel <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2016-08-10sched/core: Fix power to capacity renaming in commentMorten Rasmussen1-1/+1
It is seems that this one escaped Nico's renaming of cpu_power to cpu_capacity a while back. Signed-off-by: Morten Rasmussen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-08-02signal: consolidate {TS,TLF}_RESTORE_SIGMASK codeAndy Lutomirski1-0/+63
In general, there's no need for the "restore sigmask" flag to live in ti->flags. alpha, ia64, microblaze, powerpc, sh, sparc (64-bit only), tile, and x86 use essentially identical alternative implementations, placing the flag in ti->status. Replace those optimized implementations with an equally good common implementation that stores it in a bitfield in struct task_struct and drop the custom implementations. Additional architectures can opt in by removing their TIF_RESTORE_SIGMASK defines. Link: http://lkml.kernel.org/r/8a14321d64a28e40adfddc90e18a96c086a6d6f9.1468522723.git.luto@kernel.org Signed-off-by: Andy Lutomirski <[email protected]> Tested-by: Michael Ellerman <[email protected]> [powerpc] Cc: Richard Henderson <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Michal Simek <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-28mm, oom_reaper: do not attempt to reap a task more than twiceMichal Hocko1-0/+1
oom_reaper relies on the mmap_sem for read to do its job. Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot. Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed. This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails. If the flag is present after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim. As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to perform all reclaim attempts and stay predictable in the behavior. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-28mm, oom: skip vforked tasks from being selectedMichal Hocko1-0/+26
vforked tasks are not really sitting on any memory. They are sharing the mm with parent until they exec into a new code. Until then it is just pinning the address space. OOM killer will kill the vforked task along with its parent but we still can end up selecting vforked task when the parent wouldn't be selected. E.g. init doing vfork to launch a task or vforked being a child of oom unkillable task with an updated oom_score_adj to be killable. Add a new helper to check whether a task is in the vfork sharing memory with its parent and use it in oom_badness to skip over these tasks. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-06-27sched/fair: Fix PELT integrity for new tasksPeter Zijlstra1-2/+3
Vincent and Yuyang found another few scenarios in which entity tracking goes wobbly. The scenarios are basically due to the fact that new tasks are not immediately attached and thereby differ from the normal situation -- a task is always attached to a cfs_rq load average (such that it includes its blocked contribution) and are explicitly detached/attached on migration to another cfs_rq. Scenario 1: switch to fair class p->sched_class = fair_class; if (queued) enqueue_task(p); ... enqueue_entity() enqueue_entity_load_avg() migrated = !sa->last_update_time (true) if (migrated) attach_entity_load_avg() check_class_changed() switched_from() (!fair) switched_to() (fair) switched_to_fair() attach_entity_load_avg() If @p is a new task that hasn't been fair before, it will have !last_update_time and, per the above, end up in attach_entity_load_avg() _twice_. Scenario 2: change between cgroups sched_move_group(p) if (queued) dequeue_task() task_move_group_fair() detach_task_cfs_rq() detach_entity_load_avg() set_task_rq() attach_task_cfs_rq() attach_entity_load_avg() if (queued) enqueue_task(); ... enqueue_entity() enqueue_entity_load_avg() migrated = !sa->last_update_time (true) if (migrated) attach_entity_load_avg() Similar as with scenario 1, if @p is a new task, it will have !load_update_time and we'll end up in attach_entity_load_avg() _twice_. Furthermore, notice how we do a detach_entity_load_avg() on something that wasn't attached to begin with. As stated above; the problem is that the new task isn't yet attached to the load tracking and thereby violates the invariant assumption. This patch remedies this by ensuring a new task is indeed properly attached to the load tracking on creation, through post_init_entity_util_avg(). Of course, this isn't entirely as straightforward as one might think, since the task is hashed before we call wake_up_new_task() and thus can be poked at. We avoid this by adding TASK_NEW and teaching cpu_cgroup_can_attach() to refuse such tasks. Reported-by: Yuyang Du <[email protected]> Reported-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-06-27Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar1-1/+1
Signed-off-by: Ingo Molnar <[email protected]>
2016-06-24Clarify naming of thread info/stack allocatorsLinus Torvalds1-1/+1
We've had the thread info allocated together with the thread stack for most architectures for a long time (since the thread_info was split off from the task struct), but that is about to change. But the patches that move the thread info to be off-stack (and a part of the task struct instead) made it clear how confused the allocator and freeing functions are. Because the common case was that we share an allocation with the thread stack and the thread_info, the two pointers were identical. That identity then meant that we would have things like ti = alloc_thread_info_node(tsk, node); ... tsk->stack = ti; which certainly _worked_ (since stack and thread_info have the same value), but is rather confusing: why are we assigning a thread_info to the stack? And if we move the thread_info away, the "confusing" code just gets to be entirely bogus. So remove all this confusion, and make it clear that we are doing the stack allocation by renaming and clarifying the function names to be about the stack. The fact that the thread_info then shares the allocation is an implementation detail, and not really about the allocation itself. This is a pure renaming and type fix: we pass in the same pointer, it's just that we clarify what the pointer means. The ia64 code that actually only has one single allocation (for all of task_struct, thread_info and kernel thread stack) now looks a bit odd, but since "tsk->stack" is actually not even used there, that oddity doesn't matter. It would be a separate thing to clean that up, I intentionally left the ia64 changes as a pure brute-force renaming and type change. Acked-by: Andy Lutomirski <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-06-03sched/api: Introduce task_rcu_dereference() and try_get_task_struct()Oleg Nesterov1-0/+3
Generally task_struct is only protected by RCU if it was found on a RCU protected list (say, for_each_process() or find_task_by_vpid()). As Kirill pointed out rq->curr isn't protected by RCU, the scheduler drops the (potentially) last reference without RCU gp, this means that we need to fix the code which uses foreign_rq->curr under rcu_read_lock(). Add a new helper which can be used to dereference rq->curr or any other pointer to task_struct assuming that it should be cleared or updated before the final put_task_struct(). It returns non-NULL only if this task can't go away before rcu_read_unlock(). ( Also add try_get_task_struct() to make it easier to use this API correctly. ) Suggested-by: Kirill Tkhai <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> [ Updated comments; added try_get_task_struct()] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-26mm: oom_reaper: remove some bloatMichal Hocko1-1/+3
mmput_async is currently used only from the oom_reaper which is defined only for CONFIG_MMU. We can save work_struct in mm_struct for !CONFIG_MMU. [[email protected]: fix typo, per Minchan] Link: http://lkml.kernel.org/r/[email protected] Reported-by: Minchan Kim <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-25Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Two fixes: one for a lost wakeup, the other to fix the compiler optimizing out preempt operations on ARM64 (and possibly other non-x86 architectures)" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix remote wakeups sched/preempt: Fix preempt_count manipulations
2016-05-25sched/core: Fix remote wakeupsPeter Zijlstra1-0/+1
Commit: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration") ... introduced a bug: Mike Galbraith found that it introduced a performance regression, while Paul E. McKenney reported lost wakeups and bisected it to this commit. The reason is that I mis-read ttwu_queue() such that I assumed any wakeup that got a remote queue must have had the task migrated. Since this is not so; we need to transfer this information between queueing the wakeup and actually doing the wakeup. Use a new task_struct::sched_flag for this, we already write to sched_contributes_to_load in the wakeup path so this is a hot and modified cacheline. Reported-by: Paul E. McKenney <[email protected]> Reported-by: Mike Galbraith <[email protected]> Tested-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Hunter <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ben Segall <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matt Fleming <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Paul Turner <[email protected]> Cc: Pavan Kondeti <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Quentin Casasnovas <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-23signal: make oom_flags a boolTetsuo Handa1-1/+5
Currently the size of "struct signal_struct"->oom_flags member is sizeof(unsigned) bytes, but only one flag OOM_FLAG_ORIGIN which is updated by current thread is defined. We can convert OOM_FLAG_ORIGIN into a bool, and reuse the saved bytes for updating from the OOM killer and/or the OOM reaper thread. By the way, do we care about a race window between run_store() and swapoff() because it would be theoretically possible that two threads sharing the "struct signal_struct" concurrently call respective functions? If we care, we can make oom_flags an atomic_t. Signed-off-by: Tetsuo Handa <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-2/+21
Merge more updates from Andrew Morton: - the rest of MM - KASAN updates - procfs updates - exit, fork updates - printk updates - lib/ updates - radix-tree testsuite updates - checkpatch updates - kprobes updates - a few other misc bits * emailed patches from Andrew Morton <[email protected]>: (162 commits) samples/kprobes: print out the symbol name for the hooks samples/kprobes: add a new module parameter kprobes: add the "tls" argument for j_do_fork init/main.c: simplify initcall_blacklisted() fs/efs/super.c: fix return value checkpatch: improve --git <commit-count> shortcut checkpatch: reduce number of `git log` calls with --git checkpatch: add support to check already applied git commits checkpatch: add --list-types to show message types to show or ignore checkpatch: advertise the --fix and --fix-inplace options more checkpatch: whine about ACCESS_ONCE checkpatch: add test for keywords not starting on tabstops checkpatch: improve CONSTANT_COMPARISON test for structure members checkpatch: add PREFER_IS_ENABLED test lib/GCD.c: use binary GCD algorithm instead of Euclidean radix-tree: free up the bottom bit of exceptional entries for reuse dax: move RADIX_DAX_ definitions to dax.c radix-tree: make radix_tree_descend() more useful radix-tree: introduce radix_tree_replace_clear_tags() radix-tree: tidy up __radix_tree_create() ...
2016-05-20Merge tag 'staging-4.7-rc1' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging and IIO driver updates from Greg KH: "Here's the big staging and iio driver update for 4.7-rc1. I think we almost broke even with this release, only adding a few more lines than we removed, which isn't bad overall given that there's a bunch of new iio drivers added. The Lustre developers seem to have woken up from their sleep and have been doing a great job in cleaning up the code and pruning unused or old cruft, the filesystem is almost readable :) Other than that, just a lot of basic coding style cleanups in the churn. All have been in linux-next for a while with no reported issues" * tag 'staging-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (938 commits) Staging: emxx_udc: emxx_udc: fixed coding style issue staging/gdm724x: fix "alignment should match open parenthesis" issues staging/gdm724x: Fix avoid CamelCase staging: unisys: rename misleading var ii with frag staging: unisys: visorhba: switch success handling to error handling staging: unisys: visorhba: main path needs to flow down the left margin staging: unisys: visorinput: handle_locking_key() simplifications staging: unisys: visorhba: fail gracefully for thread creation failures staging: unisys: visornic: comment restructuring and removing bad diction staging: unisys: fix format string %Lx to %llx for u64 staging: unisys: remove unused struct members staging: unisys: visorchannel: correct variable misspelling staging: unisys: visorhba: replace functionlike macro with function staging: dgnc: Need to check for NULL of ch staging: dgnc: remove redundant condition check staging: dgnc: fix 'line over 80 characters' staging: dgnc: clean up the dgnc_get_modem_info() staging: lustre: lnet: enable configuration per NI interface staging: lustre: o2iblnd: properly set ibr_why staging: lustre: o2iblnd: remove last of kiblnd_tunables_fini ...
2016-05-20exit_thread: accept a task parameter to be exitedJiri Slaby1-2/+2
We need to call exit_thread from copy_process in a fail path. So make it accept task_struct as a parameter. [v2] * s390: exit_thread_runtime_instr doesn't make sense to be called for non-current tasks. * arm: fix the comment in vfp_thread_copy * change 'me' to 'tsk' for task_struct * now we can change only archs that actually have exit_thread [[email protected]: coding-style fixes] Signed-off-by: Jiri Slaby <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Aurelien Jacquiot <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen Liqin <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Chris Zankel <[email protected]> Cc: David Howells <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Haavard Skinnemoen <[email protected]> Cc: Hans-Christian Egtvedt <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: James Hogan <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Jesper Nilsson <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Koichi Yasutake <[email protected]> Cc: Lennox Wu <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Mikael Starvik <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Richard Kuo <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Steven Miao <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20exit_thread: remove empty bodiesJiri Slaby1-0/+7
Define HAVE_EXIT_THREAD for archs which want to do something in exit_thread. For others, let's define exit_thread as an empty inline. This is a cleanup before we change the prototype of exit_thread to accept a task parameter. [[email protected]: fix mips] Signed-off-by: Jiri Slaby <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Aurelien Jacquiot <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen Liqin <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Chris Zankel <[email protected]> Cc: David Howells <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Haavard Skinnemoen <[email protected]> Cc: Hans-Christian Egtvedt <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: James Hogan <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Jesper Nilsson <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Koichi Yasutake <[email protected]> Cc: Lennox Wu <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Mikael Starvik <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Richard Kuo <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Steven Miao <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20userfaultfd: don't pin the user memory in userfaultfd_file_create()Oleg Nesterov1-1/+6
userfaultfd_file_create() increments mm->mm_users; this means that the memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY after that can populate the orphaned mm more. Change userfaultfd_file_create() and userfaultfd_ctx_put() to use mm->mm_count to pin mm_struct. This means that atomic_inc_not_zero(mm->mm_users) is needed when we are going to actually play with this memory. Except handle_userfault() path doesn't need this, the caller must already have a reference. The patch adds the new trivial helper, mmget_not_zero(), it can have more users. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20mm,oom: speed up select_bad_process() loopTetsuo Handa1-0/+1
Since commit 3a5dda7a17cf ("oom: prevent unnecessary oom kills or kernel panics"), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [[email protected]: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tetsuo Handa <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20mm, oom_reaper: do not mmput synchronously from the oom reaper contextMichal Hocko1-0/+5
Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <[email protected]> Reported-by: Tetsuo Handa <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20mm, oom_reaper: hide oom reaped tasks from OOM killer more carefullyMichal Hocko1-0/+1
Commit 36324a990cf5 ("oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer. This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct). If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation. Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the flag after __oom_reap_task is done with a task. This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent. Signed-off-by: Michal Hocko <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-16Merge tag 'pm-4.7-rc1' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "The majority of changes go into the cpufreq subsystem this time. To me, quite obviously, the biggest ticket item is the new "schedutil" governor. Interestingly enough, it's the first new cpufreq governor since the beginning of the git era (except for some out-of-the-tree ones). There are two main differences between it and the existing governors. First, it uses the information provided by the scheduler directly for making its decisions, so it doesn't have to track anything by itself. Second, it can invoke drivers (supporting that feature) to adjust CPU performance right away without having to spawn work items to be executed in process context or similar. Currently, the acpi-cpufreq driver is the only one supporting that mode of operation, but then it is used on a large number of systems. The "schedutil" governor as included here is very simple and mostly regarded as a foundation for future work on the integration of the scheduler with CPU power management (in fact, there is work in progress on top of it already). Nevertheless it works and the preliminary results obtained with it are encouraging. There also is some consolidation of CPU frequency management for ARM platforms that can add their machine IDs the the new stub dt-platdev driver now and that will take care of creating the requisite platform device for cpufreq-dt, so it is not necessary to do that in platform code any more. Several ARM platforms are switched over to using this generic mechanism. In addition to that, the intel_pstate driver is now going to respect CPU frequency limits set by the platform firmware (or a BMC) and provided via the ACPI _PPC object. The devfreq subsystem is getting a new "passive" governor for SoCs subsystems that will depend on somebody else to manage their voltage rails and its support for Samsung Exynos SoCs is consolidated. The rest is support for new hardware (Intel Broxton support in intel_idle for one example), bug fixes, optimizations and cleanups in a number of places. Specifics: - New cpufreq "schedutil" governor (making decisions based on CPU utilization information provided by the scheduler and capable of switching CPU frequencies right away if the underlying driver supports that) and support for fast frequency switching in the acpi-cpufreq driver (Rafael Wysocki) - Consolidation of CPU frequency management on ARM platforms allowing them to get rid of some platform-specific boilerplate code if they are going to use the cpufreq-dt driver (Viresh Kumar, Finley Xiao, Marc Gonzalez) - Support for ACPI _PPC and CPU frequency limits in the intel_pstate driver (Srinivas Pandruvada) - Fixes and cleanups in the cpufreq core and generic governor code (Rafael Wysocki, Sai Gurrappadi) - intel_pstate driver optimizations and cleanups (Rafael Wysocki, Philippe Longepe, Chen Yu, Joe Perches) - cpufreq powernv driver fixes and cleanups (Akshay Adiga, Shilpasri Bhat) - cpufreq qoriq driver fixes and cleanups (Jia Hongtao) - ACPI cpufreq driver cleanups (Viresh Kumar) - Assorted cpufreq driver updates (Ashwin Chaugule, Geliang Tang, Javier Martinez Canillas, Paul Gortmaker, Sudeep Holla) - Assorted cpufreq fixes and cleanups (Joe Perches, Arnd Bergmann) - Fixes and cleanups in the OPP (Operating Performance Points) framework, mostly related to OPP sharing, and reorganization of OF-dependent code in it (Viresh Kumar, Arnd Bergmann, Sudeep Holla) - New "passive" governor for devfreq (for SoC subsystems that will rely on someone else for the management of their power resources) and consolidation of devfreq support for Exynos platforms, coding style and typo fixes for devfreq (Chanwoo Choi, MyungJoo Ham) - PM core fixes and cleanups, mostly to make it work better with the generic power domains (genpd) framework, and updates for that framework (Ulf Hansson, Thierry Reding, Colin Ian King) - Intel Broxton support for the intel_idle driver (Len Brown) - cpuidle core optimization and fix (Daniel Lezcano, Dave Gerlach) - ARM cpuidle cleanups (Jisheng Zhang) - Intel Kabylake support for the RAPL power capping driver (Jacob Pan) - AVS (Adaptive Voltage Switching) rockchip-io driver update (Heiko Stuebner) - Updates for the cpupower tool (Arjun Sreedharan, Colin Ian King, Mattia Dongili, Thomas Renninger)" * tag 'pm-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (112 commits) intel_pstate: Clean up get_target_pstate_use_performance() intel_pstate: Use sample.core_avg_perf in get_avg_pstate() intel_pstate: Clarify average performance computation intel_pstate: Avoid unnecessary synchronize_sched() during initialization cpufreq: schedutil: Make default depend on CONFIG_SMP cpufreq: powernv: del_timer_sync when global and local pstate are equal cpufreq: powernv: Move smp_call_function_any() out of irq safe block intel_pstate: Clean up intel_pstate_get() cpufreq: schedutil: Make it depend on CONFIG_SMP cpufreq: governor: Fix handling of special cases in dbs_update() PM / OPP: Move CONFIG_OF dependent code in a separate file cpufreq: intel_pstate: Ignore _PPC processing under HWP cpufreq: arm_big_little: use generic OPP functions for {init, free}_opp_table PM / OPP: add non-OF versions of dev_pm_opp_{cpumask_, }remove_table cpufreq: tango: Use generic platdev driver PM / OPP: pass cpumask by reference cpufreq: Fix GOV_LIMITS handling for the userspace governor cpupower: fix potential memory leak PM / devfreq: style/typo fixes PM / devfreq: exynos: Add the detailed correlation for Exynos5422 bus ..
2016-05-16Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-16/+108
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - massive CPU hotplug rework (Thomas Gleixner) - improve migration fairness (Peter Zijlstra) - CPU load calculation updates/cleanups (Yuyang Du) - cpufreq updates (Steve Muckle) - nohz optimizations (Frederic Weisbecker) - switch_mm() micro-optimization on x86 (Andy Lutomirski) - ... lots of other enhancements, fixes and cleanups. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits) ARM: Hide finish_arch_post_lock_switch() from modules sched/core: Provide a tsk_nr_cpus_allowed() helper sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems sched/fair: Correct unit of load_above_capacity sched/fair: Clean up scale confusion sched/nohz: Fix affine unpinned timers mess sched/fair: Fix fairness issue on migration sched/core: Kill sched_class::task_waking to clean up the migration logic sched/fair: Prepare to fix fairness problems on migration sched/fair: Move record_wakee() sched/core: Fix comment typo in wake_q_add() sched/core: Remove unused variable sched: Make hrtick_notifier an explicit call sched/fair: Make ilb_notifier an explicit call sched/hotplug: Make activate() the last hotplug step sched/hotplug: Move migration CPU_DYING to sched_cpu_dying() sched/migration: Move CPU_ONLINE into scheduler state sched/migration: Move calc_load_migrate() into CPU_DYING sched/migration: Move prepare transition to SCHED_STARTING state ...
2016-05-16Merge branch 'core-signals-for-linus' of ↵Linus Torvalds1-0/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core signal updates from Ingo Molnar: "These updates from Stas Sergeev and Andy Lutomirski, improve the sigaltstack interface by extending its ABI with the SS_AUTODISARM feature, which makes it possible to use swapcontext() in a sighandler that works on sigaltstack. Without this flag, the subsequent signal will corrupt the state of the switched-away sighandler. The inspiration is more robust dosemu signal handling" * 'core-signals-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: signals/sigaltstack: Change SS_AUTODISARM to (1U << 31) signals/sigaltstack: Report current flag bits in sigaltstack() selftests/sigaltstack: Fix the sigaltstack test on old kernels signals/sigaltstack: If SS_AUTODISARM, bypass on_sig_stack() selftests/sigaltstack: Add new testcase for sigaltstack(SS_ONSTACK|SS_AUTODISARM) signals/sigaltstack: Implement SS_AUTODISARM flag signals/sigaltstack: Prepare to add new SS_xxx flags signals/sigaltstack, x86/signals: Unify the x86 sigaltstack check with other architectures
2016-05-16Merge branch 'core-lib-for-linus' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core/lib update from Ingo Molnar: "This contains a single commit that removes an unused facility that the scheduler used to make use of" * 'core-lib-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lib/proportions: Remove unused code
2016-05-12sched/core: Provide a tsk_nr_cpus_allowed() helperThomas Gleixner1-0/+5
tsk_nr_cpus_allowed() is an accessor for task->nr_cpus_allowed which allows us to change the representation of ->nr_cpus_allowed if required. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12Merge branch 'smp/hotplug' into sched/core, to resolve conflictsIngo Molnar1-0/+9
Conflicts: kernel/sched/core.c Signed-off-by: Ingo Molnar <[email protected]>
2016-05-06sched/hotplug: Move migration CPU_DYING to sched_cpu_dying()Thomas Gleixner1-0/+6
Remove the hotplug notifier and make it an explicit state. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched/hotplug: Convert cpu_[in]active notifiers to state machineThomas Gleixner1-0/+2
Now that we reduced everything into single notifiers, it's simple to move them into the hotplug state machine space. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched: Make set_cpu_rq_start_time() a built in hotplug stateThomas Gleixner1-0/+1
Start distangling the maze of hotplug notifiers in the scheduler. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-05sched/fair: Add detailed description to the sched load avg metricsYuyang Du1-11/+49
These sched metrics have become complex enough, so describe them in detail at their definition. Signed-off-by: Yuyang Du <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> [ Fixed the text to improve its spelling and typography. ] Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-05sched/fair: Generalize the load/util averages resolution definitionYuyang Du1-3/+13
Integer metric needs fixed point arithmetic. In sched/fair, a few metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity, may have different fixed point ranges, which makes their update and usage error-prone. In order to avoid the errors relating to the fixed point range, we definie a basic fixed point range, and then formalize all metrics to base on the basic range. The basic range is 1024 or (1 << 10). Further, one can recursively apply the basic range to have larger range. Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has 1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they must be well calibrated. Signed-off-by: Yuyang Du <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-04signals/sigaltstack: If SS_AUTODISARM, bypass on_sig_stack()Andy Lutomirski1-0/+12
If a signal stack is set up with SS_AUTODISARM, then the kernel inherently avoids incorrectly resetting the signal stack if signals recurse: the signal stack will be reset on the first signal delivery. This means that we don't need check the stack pointer when delivering signals if SS_AUTODISARM is set. This will make segmented x86 programs more robust: currently there's a hole that could be triggered if ESP/RSP appears to point to the signal stack but actually doesn't due to a nonzero SS base. Signed-off-by: Andy Lutomirski <[email protected]> Cc: Al Viro <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Amanieu d'Antras <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Heinrich Schuchardt <[email protected]> Cc: Jason Low <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Moore <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stas Sergeev <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/c46bee4654ca9e68c498462fd11746e2bd0d98c8.1462296606.git.luto@kernel.org Signed-off-by: Ingo Molnar <[email protected]>
2016-05-03signals/sigaltstack: Implement SS_AUTODISARM flagStas Sergeev1-0/+8
This patch implements the SS_AUTODISARM flag that can be OR-ed with SS_ONSTACK when forming ss_flags. When this flag is set, sigaltstack will be disabled when entering the signal handler; more precisely, after saving sas to uc_stack. When leaving the signal handler, the sigaltstack is restored by uc_stack. When this flag is used, it is safe to switch from sighandler with swapcontext(). Without this flag, the subsequent signal will corrupt the state of the switched-away sighandler. To detect the support of this functionality, one can do: err = sigaltstack(SS_DISABLE | SS_AUTODISARM); if (err && errno == EINVAL) unsupported(); Signed-off-by: Stas Sergeev <[email protected]> Cc: Al Viro <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Amanieu d'Antras <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Heinrich Schuchardt <[email protected]> Cc: Jason Low <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Moore <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>