aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2016-05-13ring-buffer: Use long for nr_pages to avoid overflow failuresSteven Rostedt (Red Hat)1-12/+14
The size variable to change the ring buffer in ftrace is a long. The nr_pages used to update the ring buffer based on the size is int. On 64 bit machines this can cause an overflow problem. For example, the following will cause the ring buffer to crash: # cd /sys/kernel/debug/tracing # echo 10 > buffer_size_kb # echo 8556384240 > buffer_size_kb Then you get the warning of: WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260 Which is: RB_WARN_ON(cpu_buffer, nr_removed); Note each ring buffer page holds 4080 bytes. This is because: 1) 10 causes the ring buffer to have 3 pages. (10kb requires 3 * 4080 pages to hold) 2) (2^31 / 2^10 + 1) * 4080 = 8556384240 The value written into buffer_size_kb is shifted by 10 and then passed to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760 3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE which is 4080. 8761737461760 / 4080 = 2147484672 4) nr_pages is subtracted from the current nr_pages (3) and we get: 2147484669. This value is saved in a signed integer nr_pages_to_update 5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int turns into the value of -2147482627 6) As the value is a negative number, in update_pages_handler() it is negated and passed to rb_remove_pages() and 2147482627 pages will be removed, which is much larger than 3 and it causes the warning because not all the pages asked to be removed were removed. Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001 Cc: [email protected] # 2.6.28+ Fixes: 7a8e76a3829f1 ("tracing: unified trace buffer") Reported-by: Hao Qin <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2016-05-13secomp: Constify mode1 syscall whitelistMatt Redfearn1-2/+2
These values are constant and should be marked as such. Signed-off-by: Matt Redfearn <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Will Drewry <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: [email protected] Cc: [email protected] Patchwork: https://patchwork.linux-mips.org/patch/12979/ Signed-off-by: Ralf Baechle <[email protected]>
2016-05-13seccomp: Get compat syscalls from asm-generic headerMatt Redfearn1-8/+1
Move retrieval of compat syscall numbers into inline function defined in asm-generic header so that arches may override it. [[email protected]: Resolve merge conflict.] Suggested-by: Paul Burton <[email protected]> Signed-off-by: Matt Redfearn <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: [email protected] Cc: Arnd Bergmann <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Will Drewry <[email protected]> Cc: [email protected] Cc: [email protected] Patchwork: https://patchwork.linux-mips.org/patch/12978/ Signed-off-by: Ralf Baechle <[email protected]>
2016-05-12workqueue: fix rebind bound workers warningWanpeng Li1-0/+11
------------[ cut here ]------------ WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0 Modules linked in: CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31 Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013 0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010 0000000000000000 0000000000000000 0000000000000000 ffff881037babba8 ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046 Call Trace: dump_stack+0x89/0xd4 __warn+0xfd/0x120 warn_slowpath_null+0x1d/0x20 rebind_workers+0x1c0/0x1d0 workqueue_cpu_up_callback+0xf5/0x1d0 notifier_call_chain+0x64/0x90 ? trace_hardirqs_on_caller+0xf2/0x220 ? notify_prepare+0x80/0x80 __raw_notifier_call_chain+0xe/0x10 __cpu_notify+0x35/0x50 notify_down_prepare+0x5e/0x80 ? notify_prepare+0x80/0x80 cpuhp_invoke_callback+0x73/0x330 ? __schedule+0x33e/0x8a0 cpuhp_down_callbacks+0x51/0xc0 cpuhp_thread_fun+0xc1/0xf0 smpboot_thread_fn+0x159/0x2a0 ? smpboot_create_threads+0x80/0x80 kthread+0xef/0x110 ? wait_for_completion+0xf0/0x120 ? schedule_tail+0x35/0xf0 ret_from_fork+0x22/0x50 ? __init_kthread_worker+0x70/0x70 ---[ end trace eb12ae47d2382d8f ]--- notify_down_prepare: attempt to take down CPU 0 failed This bug can be reproduced by below config w/ nohz_full= all cpus: CONFIG_BOOTPARAM_HOTPLUG_CPU0=y CONFIG_DEBUG_HOTPLUG_CPU0=y CONFIG_NO_HZ_FULL=y As Thomas pointed out: | If a down prepare callback fails, then DOWN_FAILED is invoked for all | callbacks which have successfully executed DOWN_PREPARE. | | But, workqueue has actually two notifiers. One which handles | UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE. | | Now look at the priorities of those callbacks: | | CPU_PRI_WORKQUEUE_UP = 5 | CPU_PRI_WORKQUEUE_DOWN = -5 | | So the call order on DOWN_PREPARE is: | | CB 1 | CB ... | CB workqueue_up() -> Ignores DOWN_PREPARE | CB ... | CB X ---> Fails | | So we call up to CB X with DOWN_FAILED | | CB 1 | CB ... | CB workqueue_up() -> Handles DOWN_FAILED | CB ... | CB X-1 | | So the problem is that the workqueue stuff handles DOWN_FAILED in the up | callback, while it should do it in the down callback. Which is not a good idea | either because it wants to be called early on rollback... | | Brilliant stuff, isn't it? The hotplug rework will solve this problem because | the callbacks become symetric, but for the existing mess, we need some | workaround in the workqueue code. The boot CPU handles housekeeping duty(unbound timers, workqueues, timekeeping, ...) on behalf of full dynticks CPUs. It must remain online when nohz full is enabled. There is a priority set to every notifier_blocks: workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down So tick_nohz_cpu_down callback failed when down prepare cpu 0, and notifier_blocks behind tick_nohz_cpu_down will not be called any more, which leads to workers are actually not unbound. Then hotplug state machine will fallback to undo and online cpu 0 again. Workers will be rebound unconditionally even if they are not unbound and trigger the warning in this progress. This patch fix it by catching !DISASSOCIATED to avoid rebind bound workers. Cc: Tejun Heo <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Frédéric Weisbecker <[email protected]> Cc: [email protected] Suggested-by: Lai Jiangshan <[email protected]> Signed-off-by: Wanpeng Li <[email protected]>
2016-05-12cgroup: fix compile warningFelipe Balbi1-1/+1
commit 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") added the following compile warning: kernel/cgroup.c: In function ‘cgroup_show_path’: kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable] int len = 0, ret = 0; ^ fix it. Fixes: 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") Signed-off-by: Felipe Balbi <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2016-05-12perf/core: Disable the event on a truncated AUX recordAlexander Shishkin1-1/+9
When the PMU driver reports a truncated AUX record, it effectively means that there is no more usable room in the event's AUX buffer (even though there may still be some room, so that perf_aux_output_begin() doesn't take action). At this point the consumer still has to be woken up and the event has to be disabled, otherwise the event will just keep spinning between perf_aux_output_begin() and perf_aux_output_end() until its context gets unscheduled. Again, for cpu-wide events this means never, so once in this condition, they will be forever losing data. Fix this by disabling the event and waking up the consumer in case of a truncated AUX record. Reported-by: Markus Metzger <[email protected]> Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/1462886313-13660-3-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12perf/core: Disable the event on a truncated AUX recordAlexander Shishkin1-1/+9
When the PMU driver reports a truncated AUX record, it effectively means that there is no more usable room in the event's AUX buffer (even though there may still be some room, so that perf_aux_output_begin() doesn't take action). At this point the consumer still has to be woken up and the event has to be disabled, otherwise the event will just keep spinning between perf_aux_output_begin() and perf_aux_output_end() until its context gets unscheduled. Again, for cpu-wide events this means never, so once in this condition, they will be forever losing data. Fix this by disabling the event and waking up the consumer in case of a truncated AUX record. Reported-by: Markus Metzger <[email protected]> Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/1462886313-13660-3-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12sched/core: Provide a tsk_nr_cpus_allowed() helperThomas Gleixner3-27/+27
tsk_nr_cpus_allowed() is an accessor for task->nr_cpus_allowed which allows us to change the representation of ->nr_cpus_allowed if required. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowedThomas Gleixner3-5/+5
Use the future-safe accessor for struct task_struct's. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systemsVik Heyndrickx1-4/+7
Systems show a minimal load average of 0.00, 0.01, 0.05 even when they have no load at all. Uptime and /proc/loadavg on all systems with kernels released during the last five years up until kernel version 4.6-rc5, show a 5- and 15-minute minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on idle systems, but the way the kernel calculates this value prevents it from getting lower than the mentioned values. Likewise but not as obviously noticeable, a fully loaded system with no processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95 (multiplied by number of cores). Once the (old) load becomes 93 or higher, it mathematically can never get lower than 93, even when the active (load) remains 0 forever. This results in the strange 0.00, 0.01, 0.05 uptime values on idle systems. Note: 93/2048 = 0.0454..., which rounds up to 0.05. It is not correct to add a 0.5 rounding (=1024/2048) here, since the result from this function is fed back into the next iteration again, so the result of that +0.5 rounding value then gets multiplied by (2048-2037), and then rounded again, so there is a virtual "ghost" load created, next to the old and active load terms. By changing the way the internally kept value is rounded, that internal value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon increasing load, the internally kept load value is rounded up, when the load is decreasing, the load value is rounded down. The modified code was tested on nohz=off and nohz kernels. It was tested on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was tested on single, dual, and octal cores system. It was tested on virtual hosts and bare hardware. No unwanted effects have been observed, and the problems that the patch intended to fix were indeed gone. Tested-by: Damien Wyart <[email protected]> Signed-off-by: Vik Heyndrickx <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: Doug Smythies <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: 0f004f5a696a ("sched: Cure more NO_HZ load average woes") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12sched/fair: Correct unit of load_above_capacityMorten Rasmussen1-2/+4
In calculate_imbalance() load_above_capacity currently has the unit [capacity] while it is used as being [load/capacity]. Not only is it wrong it also makes it unlikely that load_above_capacity is ever used as the subsequent code picks the smaller of load_above_capacity and the avg_load This patch ensures that load_above_capacity has the right unit [load/capacity]. Signed-off-by: Morten Rasmussen <[email protected]> [ Changed changelog to note it was in capacity unit; +rebase. ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12sched/fair: Clean up scale confusionPeter Zijlstra1-2/+1
Wanpeng noted that the scale_load_down() in calculate_imbalance() was weird. I agree, it should be SCHED_CAPACITY_SCALE, since we're going to compare against busiest->group_capacity, which is in [capacity] units. Reported-by: Wanpeng Li <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Yuyang Du <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12sched/nohz: Fix affine unpinned timers messWanpeng Li1-1/+4
The following commit: 9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")' intended to affine unpinned timers to housekeepers: unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(houserkeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself) However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the intention to: unpinned timers(full dynaticks, idle) => any housekeepers(no mattter cpu topology) unpinned timers(full dynaticks, busy) => any housekeepers(no mattter cpu topology) unpinned timers(housekeepers, idle) => any busy cpus(otherwise, fallback to any housekeepers) This patch fixes it by checking if there are busy housekeepers nearby, otherwise falls to any housekeepers/itself. After the patch: unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(housekeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself) Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> [ Fixed the changelog. ] Cc: Frederic Weisbecker <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Fixes: 'commit 9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")' Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12sched/fair: Fix fairness issue on migrationPeter Zijlstra1-6/+16
Pavan reported that in the presence of very light tasks (or cgroups) the placement of migrated tasks can cause severe fairness issues. The problem is that enqueue_entity() places the task before it updates time, thereby it can place the task far in the past (remember that light tasks will shoot virtual time forward at a high speed, so in relation to the pre-existing light task, we can land far in the past). This is done because update_curr() needs the current task, and we might be placing the current task. The obvious solution is to differentiate between the current and any other task; placing the current before we update time, and placing any other task after, such that !curr tasks end up at the current moment in time, and not in the past. This commit re-introduces the previously reverted commit: 3a47d5124a95 ("sched/fair: Fix fairness issue on migration") ... which is now safe to do, after we've also fixed another underlying bug first, in: sched/fair: Prepare to fix fairness problems on migration and cleaned up other details in the migration code: sched/core: Kill sched_class::task_waking Reported-by: Pavan Kondeti <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12sched/core: Kill sched_class::task_waking to clean up the migration logicPeter Zijlstra3-42/+32
With sched_class::task_waking being called only when we do set_task_cpu(), we can make sched_class::migrate_task_rq() do the work and eliminate sched_class::task_waking entirely. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Hunter <[email protected]> Cc: Ben Segall <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matt Fleming <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: Paul Turner <[email protected]> Cc: Pavan Kondeti <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12sched/fair: Prepare to fix fairness problems on migrationPeter Zijlstra2-8/+58
Mike reported that our recent attempt to fix migration problems: 3a47d5124a95 ("sched/fair: Fix fairness issue on migration") broke interactivity and the signal starve test. We reverted that commit and now let's try it again more carefully, with some other underlying problems fixed first. One problem is that I assumed ENQUEUE_WAKING was only set when we do a cross-cpu wakeup (migration), which isn't true. This means we now destroy the vruntime history of tasks and wakeup-preemption suffers. Cure this by making my assumption true, only call sched_class::task_waking() when we do a cross-cpu wakeup. This avoids the indirect call in the case we do a local wakeup. Reported-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Hunter <[email protected]> Cc: Ben Segall <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matt Fleming <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: Paul Turner <[email protected]> Cc: Pavan Kondeti <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Fixes: 3a47d5124a95 ("sched/fair: Fix fairness issue on migration") Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12sched/fair: Move record_wakee()Peter Zijlstra1-28/+33
Since I want to make ->task_woken() conditional on the task getting migrated, we cannot use it to call record_wakee(). Move it to select_task_rq_fair(), which gets called in almost all the same conditions. The only exception is if the woken task (@p) is CPU-bound (as per the nr_cpus_allowed test in select_task_rq()). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Hunter <[email protected]> Cc: Ben Segall <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matt Fleming <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: Paul Turner <[email protected]> Cc: Pavan Kondeti <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12Merge branch 'smp/hotplug' into sched/core, to resolve conflictsIngo Molnar4-255/+195
Conflicts: kernel/sched/core.c Signed-off-by: Ingo Molnar <[email protected]>
2016-05-12Merge branch 'sched/urgent' into sched/core to pick up fixesIngo Molnar15-82/+212
Signed-off-by: Ingo Molnar <[email protected]>
2016-05-11Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar9-53/+105
Signed-off-by: Ingo Molnar <[email protected]>
2016-05-11genirq: Ensure IRQ descriptor is valid when setting-up the IRQJon Hunter1-1/+1
In the function, setup_irq(), we don't check that the descriptor returned from irq_to_desc() is valid before we start using it. For example chip_bus_lock() called from setup_irq(), assumes that the descriptor pointer is valid and doesn't check before dereferencing it. In many other functions including setup/free_percpu_irq() we do check that the descriptor returned is not NULL and therefore add the same test to setup_irq() to ensure the descriptor returned is valid. Signed-off-by: Jon Hunter <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2016-05-11Revert "sched/fair: Fix fairness issue on migration"Ingo Molnar1-14/+6
Mike reported that this recent commit: 3a47d5124a95 ("sched/fair: Fix fairness issue on migration") ... broke interactivity and the signal starvation test. We have a proper fix series in the works but ran out of time for v4.6, so revert the commit. Reported-by: Mike Galbraith <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-10Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds3-1/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A UP kernel cpufreq fix and a rt/dl scheduler corner case fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/rt, sched/dl: Don't push if task's scheduling class was changed sched/fair: Fix !CONFIG_SMP kernel cpufreq governor breakage
2016-05-10gcov: disable for COMPILE_TESTArnd Bergmann1-0/+1
Enabling gcov is counterproductive to compile testing: it significantly increases the kernel image size, compile time, and it produces lots of false positive "may be used uninitialized" warnings as the result of missed optimizations. This is in line with how UBSAN_SANITIZE_ALL and PROFILE_ALL_BRANCHES work, both of which have similar problems. With an ARM allmodconfig kernel, I see the build time drop from 283 minutes CPU time to 225 minutes, and the vmlinux size drops from 43MB to 26MB. Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Peter Oberparleiter <[email protected]> Signed-off-by: Michal Marek <[email protected]>
2016-05-10blktrace: add missed mask nameShaohua Li1-0/+1
BLK_TC_NOTIFY is missed in mask_maps, so we can't print out notify or set mask with 'notify' name. Signed-off-by: Shaohua Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-05-10blktrace: delete garbage for message traceShaohua Li1-0/+1
commit f4a1d08ce65 introduces a regression. Originally for BLK_TN_MESSAGE, we add message in trace and return. The commit ignores the early return and add garbage info. Signed-off-by: Shaohua Li <[email protected]> Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-05-10sched/rt, sched/dl: Don't push if task's scheduling class was changedXunlei Pang2-0/+2
We got this warning: WARNING: CPU: 1 PID: 2468 at kernel/sched/core.c:1161 set_task_cpu+0x1af/0x1c0 [...] Call Trace: dump_stack+0x63/0x87 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 set_task_cpu+0x1af/0x1c0 push_dl_task.part.34+0xea/0x180 push_dl_tasks+0x17/0x30 __balance_callback+0x45/0x5c __sched_setscheduler+0x906/0xb90 SyS_sched_setattr+0x150/0x190 do_syscall_64+0x62/0x110 entry_SYSCALL64_slow_path+0x25/0x25 This corresponds to: WARN_ON_ONCE(p->state == TASK_RUNNING && p->sched_class == &fair_sched_class && (p->on_rq && !task_on_rq_migrating(p))) It happens because in find_lock_later_rq(), the task whose scheduling class was changed to fair class is still pushed away as if it were a deadline task ... So, check in find_lock_later_rq() after double_lock_balance(), if the scheduling class of the deadline task was changed, break and retry. Apply the same logic to RT tasks. Signed-off-by: Xunlei Pang <[email protected]> Reviewed-by: Steven Rostedt <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Juri Lelli <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-09perf/core: Change the default paranoia level to 2Andy Lutomirski1-1/+1
Allowing unprivileged kernel profiling lets any user dump follow kernel control flow and dump kernel registers. This most likely allows trivial kASLR bypassing, and it may allow other mischief as well. (Off the top of my head, the PERF_SAMPLE_REGS_INTR output during /dev/urandom reads could be quite interesting.) Signed-off-by: Andy Lutomirski <[email protected]> Acked-by: Kees Cook <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-15/+23
In netdevice.h we removed the structure in net-next that is being changes in 'net'. In macsec.c and rtnetlink.c we have overlaps between fixes in 'net' and the u64 attribute changes in 'net-next'. The mlx5 conflicts have to do with vxlan support dependencies. Signed-off-by: David S. Miller <[email protected]>
2016-05-09cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespacesSerge E. Hallyn1-0/+63
Patch summary: When showing a cgroupfs entry in mountinfo, show the path of the mount root dentry relative to the reader's cgroup namespace root. Short explanation (courtesy of mkerrisk): If we create a new cgroup namespace, then we want both /proc/self/cgroup and /proc/self/mountinfo to show cgroup paths that are correctly virtualized with respect to the cgroup mount point. Previous to this patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo does not. Long version: When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup namespace, and then mounts a new instance of the freezer cgroup, the new mount will be rooted at /a/b. The root dentry field of the mountinfo entry will show '/a/b'. cat > /tmp/do1 << EOF mount -t cgroup -o freezer freezer /mnt grep freezer /proc/self/mountinfo EOF unshare -Gm bash /tmp/do1 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer The task's freezer cgroup entry in /proc/self/cgroup will simply show '/': grep freezer /proc/self/cgroup 9:freezer:/ If instead the same task simply bind mounts the /a/b cgroup directory, the resulting mountinfo entry will again show /a/b for the dentry root. However in this case the task will find its own cgroup at /mnt/a/b, not at /mnt: mount --bind /sys/fs/cgroup/freezer/a/b /mnt 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer In other words, there is no way for the task to know, based on what is in mountinfo, which cgroup directory is its own. Example (by mkerrisk): First, a little script to save some typing and verbiage: echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)" cat /proc/self/mountinfo | grep freezer | awk '{print "\tmountinfo:\t\t" $4 "\t" $5}' Create cgroup, place this shell into the cgroup, and look at the state of the /proc files: 2653 2653 # Our shell 14254 # cat(1) /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer Create a shell in new cgroup and mount namespaces. The act of creating a new cgroup namespace causes the process's current cgroups directories to become its cgroup root directories. (Here, I'm using my own version of the "unshare" utility, which takes the same options as the util-linux version): Look at the state of the /proc files: /proc/self/cgroup: 10:freezer:/ mountinfo: / /sys/fs/cgroup/freezer The third entry in /proc/self/cgroup (the pathname of the cgroup inside the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is rooted at /a/b in the outer namespace. However, the info in /proc/self/mountinfo is not for this cgroup namespace, since we are seeing a duplicate of the mount from the old mount namespace, and the info there does not correspond to the new cgroup namespace. However, trying to create a new mount still doesn't show us the right information in mountinfo: # propagating to other mountns /proc/self/cgroup: 7:freezer:/ mountinfo: /a/b /mnt/freezer The act of creating a new cgroup namespace caused the process's current freezer directory, "/a/b", to become its cgroup freezer root directory. In other words, the pathname directory of the directory within the newly mounted cgroup filesystem should be "/", but mountinfo wrongly shows us "/a/b". The consequence of this is that the process in the cgroup namespace cannot correctly construct the pathname of its cgroup root directory from the information in /proc/PID/mountinfo. With this patch, the dentry root field in mountinfo is shown relative to the reader's cgroup namespace. So the same steps as above: /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release 3164 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) Signed-off-by: Serge Hallyn <[email protected]> Tested-by: Michael Kerrisk <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2016-05-09sched/core: Fix comment typo in wake_q_add()Davidlohr Bueso1-1/+1
... the comment clearly refers to wake_up_q(), and not wake_up_list(). Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-09sched/core: Remove unused variableMuhammad Falak R Wani1-2/+2
Remove unused variable 'ret', and directly return 0. Signed-off-by: Muhammad Falak R Wani <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-07sched/fair: Fix !CONFIG_SMP kernel cpufreq governor breakageRafael J. Wysocki1-1/+8
The following commit: 34e2c555f3e1 ("cpufreq: Add mechanism for registering utilization update callbacks") overlooked the fact that update_load_avg(), where CFS invokes cpufreq utilization update callbacks, becomes an empty stub on UP kernels. In consequence, if !CONFIG_SMP, cpufreq governors are never invoked from CFS and they do not have a chance to evaluate CPU performace levels and update them often enough. Needless to say, things don't work as expected then. Fix the problem by making the !CONFIG_SMP stub of update_load_avg() invoke cpufreq update callbacks too. Reported-by: Steve Muckle <[email protected]> Tested-by: Steve Muckle <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]> Acked-by: Steve Muckle <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Linux PM list <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Srinivas Pandruvada <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Fixes: 34e2c555f3e1 (cpufreq: Add mechanism for registering utilization update callbacks) Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-05-06bpf: improve verifier state equivalenceAlexei Starovoitov1-20/+3
since UNKNOWN_VALUE type is weaker than CONST_IMM we can un-teach verifier its recognition of constants in conditional branches without affecting safety. Ex: if (reg == 123) { .. here verifier was marking reg->type as CONST_IMM instead keep reg as UNKNOWN_VALUE } Two verifier states with UNKNOWN_VALUE are equivalent, whereas CONST_IMM_X != CONST_IMM_Y, since CONST_IMM is used for stack range verification and other cases. So help search pruning by marking registers as UNKNOWN_VALUE where possible instead of CONST_IMM. Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-06bpf: direct packet accessAlexei Starovoitov2-8/+438
Extended BPF carried over two instructions from classic to access packet data: LD_ABS and LD_IND. They're highly optimized in JITs, but due to their design they have to do length check for every access. When BPF is processing 20M packets per second single LD_ABS after JIT is consuming 3% cpu. Hence the need to optimize it further by amortizing the cost of 'off < skb_headlen' over multiple packet accesses. One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW with similar usage as skb_header_pointer(). The kernel part for interpreter and x64 JIT was implemented in [1], but such new insns behave like old ld_abs and abort the program with 'return 0' if access is beyond linear data. Such hidden control flow is hard to workaround plus changing JITs and rolling out new llvm is incovenient. Therefore allow cls_bpf/act_bpf program access skb->data directly: int bpf_prog(struct __sk_buff *skb) { struct iphdr *ip; if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end) /* packet too small */ return 0; ip = skb->data + ETH_HLEN; /* access IP header fields with direct loads */ if (ip->version != 4 || ip->saddr == 0x7f000001) return 1; [...] } This solution avoids introduction of new instructions. llvm stays the same and all JITs stay the same, but verifier has to work extra hard to prove safety of the above program. For XDP the direct store instructions can be allowed as well. The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check the alignment. The complex packet parsers where packet pointer is adjusted incrementally cannot be tracked for alignment, so allow byte access in such cases and misaligned access on architectures that define efficient_unaligned_access [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-06bpf: cleanup verifier codeAlexei Starovoitov1-47/+53
cleanup verifier code and prepare it for addition of "pointer to packet" logic Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-06Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-13/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "This contains a single fix that fixes a nohz tick stopping bug when mixed-poliocy SCHED_FIFO and SCHED_RR tasks are present on a runqueue" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()
2016-05-06sched: Make hrtick_notifier an explicit callThomas Gleixner1-33/+1
No need for an extra notifier. We don't need to handle all these states. It's sufficient to kill the timer when the cpu dies. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched/fair: Make ilb_notifier an explicit callThomas Gleixner3-14/+6
No need for an extra notifier. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched/hotplug: Make activate() the last hotplug stepThomas Gleixner1-6/+9
The scheduler can handle per cpu threads before the cpu is set to active and it does not allow user space threads on the cpu before active is set. Attaching to the scheduling domains is also not required before user space threads can be handled. Move the activation to the end of the hotplug state space. That also means that deactivation is the first action when a cpu is shut down. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched/hotplug: Move migration CPU_DYING to sched_cpu_dying()Thomas Gleixner2-51/+23
Remove the hotplug notifier and make it an explicit state. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched/migration: Move CPU_ONLINE into scheduler stateThomas Gleixner1-11/+22
The alleged requirement that the migration notifier has a lower priority than perf is completely undocumented and there is no indication at all that this is true. perf does not even handle the CPU_ONLINE notification and perf really has nothing to do with migration. Move the CPU_ONLINE code into the sched_activate_cpu() state callback. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched/migration: Move calc_load_migrate() into CPU_DYINGThomas Gleixner1-3/+0
It really does not matter when we fold the load for the outgoing cpu. It's almost dead anyway, so there is no harm if we fail to fold the few microseconds which are required for going fully away. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched/migration: Move prepare transition to SCHED_STARTING stateThomas Gleixner1-9/+11
We can piggy pack that on the SCHED_STARTING state. It's not required before the cpu actually comes online. Name the function proper as it has nothing to do with migration. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched/hotplug: Move sync_rcu to be with set_cpu_active(false)Peter Zijlstra2-15/+14
The sync_rcu stuff is specificically for clearing bits in the active mask, such that everybody will observe the bit cleared and will not consider the cleared CPU for load-balancing etc. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched/hotplug: Convert cpu_[in]active notifiers to state machineThomas Gleixner2-48/+27
Now that we reduced everything into single notifiers, it's simple to move them into the hotplug state machine space. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched: Move sched_domains_numa_masks_clear() to DOWN_PREPAREThomas Gleixner1-3/+0
This is the last operation on the cpu before vanishing. No point in calling that on CPU_DEAD. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched: Consolidate the notifier mazeThomas Gleixner1-105/+69
We can maintain the ordering of the scheduler cpu hotplug functionality nicely in one notifer. Get rid of the maze. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched: Allow hotplug notifiers to be setup earlyThomas Gleixner1-23/+36
Prevent the SMP scheduler related notifiers to be executed before the smp scheduler is initialized and install them early. This is a preparatory change for further consolidation of the hotplug notifier maze. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-05-06sched: Make set_cpu_rq_start_time() a built in hotplug stateThomas Gleixner2-7/+15
Start distangling the maze of hotplug notifiers in the scheduler. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>