aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2016-09-21bpf: recognize 64bit immediate loads as constsJakub Kicinski1-2/+12
When running as parser interpret BPF_LD | BPF_IMM | BPF_DW instructions as loading CONST_IMM with the value stored in imm. The verifier will continue not recognizing those due to concerns about search space/program complexity increase. Signed-off-by: Jakub Kicinski <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-09-21bpf: enable non-core use of the verfierJakub Kicinski1-0/+68
Advanced JIT compilers and translators may want to use eBPF verifier as a base for parsers or to perform custom checks and validations. Add ability for external users to invoke the verifier and provide callbacks to be invoked for every intruction checked. For now only add most basic callback for per-instruction pre-interpretation checks is added. More advanced users may also like to have per-instruction post callback and state comparison callback. Signed-off-by: Jakub Kicinski <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-09-21bpf: expose internal verfier structuresJakub Kicinski1-163/+103
Move verifier's internal structures to a header file and prefix their names with bpf_ to avoid potential namespace conflicts. Those structures will soon be used by external analyzers. Signed-off-by: Jakub Kicinski <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-09-21bpf: don't (ab)use instructions to store stateJakub Kicinski1-30/+40
Storing state in reserved fields of instructions makes it impossible to run verifier on programs already marked as read-only. Allocate and use an array of per-instruction state instead. While touching the error path rename and move existing jump target. Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-09-20bpf: direct packet write and access for helpers for clsact progsDaniel Borkmann2-14/+43
This work implements direct packet access for helpers and direct packet write in a similar fashion as already available for XDP types via commits 4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and 6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a complementary feature to the already available direct packet read for tc (cls/act) programs. For enabling this, we need to introduce two helpers, bpf_skb_pull_data() and bpf_csum_update(). The first is generally needed for both, read and write, because they would otherwise only be limited to the current linear skb head. Usually, when the data_end test fails, programs just bail out, or, in the direct read case, use bpf_skb_load_bytes() as an alternative to overcome this limitation. If such data sits in non-linear parts, we can just pull them in once with the new helper, retest and eventually access them. At the same time, this also makes sure the skb is uncloned, which is, of course, a necessary condition for direct write. As this needs to be an invariant for the write part only, the verifier detects writes and adds a prologue that is calling bpf_skb_pull_data() to effectively unclone the skb from the very beginning in case it is indeed cloned. The heuristic makes use of a similar trick that was done in 233577a22089 ("net: filter: constify detection of pkt_type_offset"). This comes at zero cost for other programs that do not use the direct write feature. Should a program use this feature only sparsely and has read access for the most parts with, for example, drop return codes, then such write action can be delegated to a tail called program for mitigating this cost of potential uncloning to a late point in time where it would have been paid similarly with the bpf_skb_store_bytes() as well. Advantage of direct write is that the writes are inlined whereas the helper cannot make any length assumptions and thus needs to generate a call to memcpy() also for small sizes, as well as cost of helper call itself with sanity checks are avoided. Plus, when direct read is already used, we don't need to cache or perform rechecks on the data boundaries (due to verifier invalidating previous checks for helpers that change skb->data), so more complex programs using rewrites can benefit from switching to direct read plus write. For direct packet access to helpers, we save the otherwise needed copy into a temp struct sitting on stack memory when use-case allows. Both facilities are enabled via may_access_direct_pkt_data() in verifier. For now, we limit this to map helpers and csum_diff, and can successively enable other helpers where we find it makes sense. Helpers that definitely cannot be allowed for this are those part of bpf_helper_changes_skb_data() since they can change underlying data, and those that write into memory as this could happen for packet typed args when still cloned. bpf_csum_update() helper accommodates for the fact that we need to fixup checksum_complete when using direct write instead of bpf_skb_store_bytes(), meaning the programs can use available helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(), csum_block_add(), csum_block_sub() equivalents in eBPF together with the new helper. A usage example will be provided for iproute2's examples/bpf/ directory. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-09-20bpf, verifier: enforce larger zero range for pkt on overloading stack buffsDaniel Borkmann1-1/+1
Current contract for the following two helper argument types is: * ARG_CONST_STACK_SIZE: passed argument pair must be (ptr, >0). * ARG_CONST_STACK_SIZE_OR_ZERO: passed argument pair can be either (NULL, 0) or (ptr, >0). With 6841de8b0d03 ("bpf: allow helpers access the packet directly"), we can pass also raw packet data to helpers, so depending on the argument type being PTR_TO_PACKET, we now either assert memory via check_packet_access() or check_stack_boundary(). As a result, the tests in check_packet_access() currently allow more than intended with regards to reg->imm. Back in 969bf05eb3ce ("bpf: direct packet access"), check_packet_access() was fine to ignore size argument since in check_mem_access() size was bpf_size_to_bytes() derived and prior to the call to check_packet_access() guaranteed to be larger than zero. However, for the above two argument types, it currently means, we can have a <= 0 size and thus breaking current guarantees for helpers. Enforce a check for size <= 0 and bail out if so. check_stack_boundary() doesn't have such an issue since it already tests for access_size <= 0 and bails out, resp. access_size == 0 in case of NULL pointer passed when allowed. Fixes: 6841de8b0d03 ("bpf: allow helpers access the packet directly") Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-09-20Merge branch 'irq/urgent' into irq/coreThomas Gleixner4-16/+76
Merge urgent fixes so pending patches for 4.9 can be applied.
2016-09-20Merge branch 'linus' into x86/asm, to pick up fixesIngo Molnar1-0/+6
Signed-off-by: Ingo Molnar <[email protected]>
2016-09-19cgroup: duplicate cgroup reference when cloning socketsJohannes Weiner1-0/+6
When a socket is cloned, the associated sock_cgroup_data is duplicated but not its reference on the cgroup. As a result, the cgroup reference count will underflow when both sockets are destroyed later on. Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: <[email protected]> [4.5+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-09-19padata: Convert to hotplug state machineSebastian Andrzej Siewior1-38/+50
Install the callbacks via the state machine. CPU-hotplug multinstance support is used with the nocalls() version. Maybe parts of padata_alloc() could be moved into the online callback so that we could invoke ->startup callback for instance and drop get_online_cpus(). Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Cc: Steffen Klassert <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-09-19genirq: Skip chained interrupt trigger setup if type is IRQ_TYPE_NONEMarc Zyngier1-2/+6
There is no point in trying to configure the trigger of a chained interrupt if no trigger information has been configured. At best this is ignored, and at the worse this confuses the underlying irqchip (which is likely not to handle such a thing), and unnecessarily alarms the user. Only apply the configuration if type is not IRQ_TYPE_NONE. Fixes: 1e12c4a9393b ("genirq: Correctly configure the trigger on chained interrupts") Reported-and-tested-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lkml.kernel.org/r/CAMuHMdVW1eTn20=EtYcJ8hkVwohaSuH_yQXrY2MGBEvZ8fpFOg@mail.gmail.com Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-09-17workqueue: remove keventd_up()Tejun Heo1-1/+1
keventd_up() no longer has in-kernel users. Remove it and make wq_online static. Signed-off-by: Tejun Heo <[email protected]>
2016-09-17power, workqueue: remove keventd_up() usageTejun Heo1-10/+1
Now that workqueue can handle work item queueing/cancelling from very early during boot, there is no need to gate cancel_delayed_work_sync() while !keventd_up(). Remove it. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Cc: Qiao Zhou <[email protected]>
2016-09-17workqueue: make workqueue available early during bootTejun Heo1-16/+60
Workqueue is currently initialized in an early init call; however, there are cases where early boot code has to be split and reordered to come after workqueue initialization or the same code path which makes use of workqueues is used both before workqueue initailization and after. The latter cases have to gate workqueue usages with keventd_up() tests, which is nasty and easy to get wrong. Workqueue usages have become widespread and it'd be a lot more convenient if it can be used very early from boot. This patch splits workqueue initialization into two steps. workqueue_init_early() which sets up the basic data structures so that workqueues can be created and work items queued, and workqueue_init() which actually brings up workqueues online and starts executing queued work items. The former step can be done very early during boot once memory allocation, cpumasks and idr are initialized. The latter right after kthreads become available. This allows work item queueing and canceling from very early boot which is what most of these use cases want. * As systemd_wq being initialized doesn't indicate that workqueue is fully online anymore, update keventd_up() to test wq_online instead. The follow-up patches will get rid of all its usages and the function itself. * Flushing doesn't make sense before workqueue is fully initialized. The flush functions trigger WARN and return immediately before fully online. * Work items are never in-flight before fully online. Canceling can always succeed by skipping the flush step. * Some code paths can no longer assume to be called with irq enabled as irq is disabled during early boot. Use irqsave/restore operations instead. v2: Watchdog init, which requires timer to be running, moved from workqueue_init_early() to workqueue_init(). Signed-off-by: Tejun Heo <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com
2016-09-16cpuset: fix non static symbol warningWei Yongjun1-1/+1
Fixes the following sparse warning: kernel/cpuset.c:2088:6: warning: symbol 'cpuset_fork' was not declared. Should it be static? Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2016-09-16workqueue: dump workqueue state on sanity check failures in destroy_workqueue()Tejun Heo1-0/+2
destroy_workqueue() performs a number of sanity checks to ensure that the workqueue is empty before proceeding with destruction. However, it's not always easy to tell what's going on just from the warning message. Let's dump workqueue state after sanity check failures to help debugging. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/CACT4Y+Zs6vkjHo9qHb4TrEiz3S4+quvvVQ9VWvj2Mx6pETGb9Q@mail.gmail.com Cc: Dmitry Vyukov <[email protected]>
2016-09-16fork: Optimize task creation by caching two thread stacks per CPU if ↵Andy Lutomirski1-9/+53
CONFIG_VMAP_STACK=y vmalloc() is a bit slow, and pounding vmalloc()/vfree() will eventually force a global TLB flush. To reduce pressure on them, if CONFIG_VMAP_STACK=y, cache two thread stacks per CPU. This will let us quickly allocate a hopefully cache-hot, TLB-hot stack under heavy forking workloads (shell script style). On my silly pthread_create() benchmark, it saves about 2 µs per pthread_create()+join() with CONFIG_VMAP_STACK=y. Signed-off-by: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jann Horn <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/94811d8e3994b2e962f88866290017d498eb069c.1474003868.git.luto@kernel.org Signed-off-by: Ingo Molnar <[email protected]>
2016-09-16sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASKAndy Lutomirski2-1/+38
We currently keep every task's stack around until the task_struct itself is freed. This means that we keep the stack allocation alive for longer than necessary and that, under load, we free stacks in big batches whenever RCU drops the last task reference. Neither of these is good for reuse of cache-hot memory, and freeing in batches prevents us from usefully caching small numbers of vmalloced stacks. On architectures that have thread_info on the stack, we can't easily change this, but on architectures that set THREAD_INFO_IN_TASK, we can free it as soon as the task is dead. Signed-off-by: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jann Horn <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.org Signed-off-by: Ingo Molnar <[email protected]>
2016-09-16kthread: Pin the stack via try_get_task_stack()/put_task_stack() in ↵Oleg Nesterov1-2/+6
to_live_kthread() function get_task_struct(tsk) no longer pins tsk->stack so all users of to_live_kthread() should do try_get_task_stack/put_task_stack to protect "struct kthread" which lives on kthread's stack. TODO: Kill to_live_kthread(), perhaps we can even kill "struct kthread" too, and rework kthread_stop(), it can use task_work_add() to sync with the exiting kernel thread. Message-Id: <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jann Horn <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/cb9b16bbc19d4aea4507ab0552e4644c1211d130.1474003868.git.luto@kernel.org Signed-off-by: Ingo Molnar <[email protected]>
2016-09-16Merge branch 'for-mingo' of ↵Ingo Molnar11-120/+164
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU changes from Paul E. McKenney: - Expedited grace-period changes, most notably avoiding having user threads drive expedited grace periods, using a workqueue instead. - Miscellaneous fixes, including a performance fix for lists that was sent with the lists modifications (second URL below). - CPU hotplug updates, most notably providing exact CPU-online tracking for RCU. This will in turn allow removal of the checks supporting RCU's prior heuristic that was based on the assumption that CPUs would take no longer than one jiffy to come online. - Torture-test updates. - Documentation updates. Signed-off-by: Ingo Molnar <[email protected]>
2016-09-15Merge branch 'irq/for-block' into irq/coreThomas Gleixner3-56/+168
Add the new irq spreading infrastructure.
2016-09-15sched/core: Allow putting thread_info into task_structAndy Lutomirski1-0/+4
If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT, then thread_info is defined as a single 'u32 flags' and is the first entry of task_struct. thread_info::task is removed (it serves no purpose if thread_info is embedded in task_struct), and thread_info::cpu gets its own slot in task_struct. This is heavily based on a patch written by Linus. Originally-from: Linus Torvalds <[email protected]> Signed-off-by: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jann Horn <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org Signed-off-by: Ingo Molnar <[email protected]>
2016-09-15Merge branch 'linus' into x86/asm, to pick up recent fixesIngo Molnar22-63/+260
Signed-off-by: Ingo Molnar <[email protected]>
2016-09-14genirq/affinity: Remove old irq spread infrastructureThomas Gleixner1-58/+0
No more users. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-09-14genirq/msi: Switch to new irq spreading infrastructureThomas Gleixner1-16/+15
Switch MSI over to the new spreading code. If a pci device contains a valid pointer to a cpumask, then this mask is used for spreading otherwise the online cpu mask is used. This allows a driver to restrict the spread to a subset of CPUs, e.g. cpus on a particular node. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-09-14genirq/affinity: Provide smarter irq spreading infrastructureThomas Gleixner1-0/+149
The current irq spreading infrastructure is just looking at a cpumask and tries to spread the interrupts over the mask. Thats suboptimal as it does not take numa nodes into account. Change the logic so the interrupts are spread across numa nodes and inside the nodes. If there are more cpus than vectors per node, then we set the affinity to several cpus. If HT siblings are available we take that into account and try to set all siblings to a single vector. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2016-09-14genirq/msi: Add cpumask allocation to alloc_msi_entryThomas Gleixner1-2/+24
For irq spreading want to store affinity masks in the msi_entry. Add the infrastructure for it. We allocate an array of cpumasks with an array size of the number of used vectors in the entry, so we can hand in the information per linux interrupt later. As we hand in the number of used vectors, we assign them right away. Convert all the call sites. Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2016-09-14Merge branches 'doc.2016.08.22c', 'exp.2016.08.22c', 'fixes.2016.09.14a', ↵Paul E. McKenney11-120/+164
'hotplug.2016.08.22c' and 'torture.2016.08.22c' into HEAD doc.2016.08.22c: Documentation updates exp.2016.08.22c: Expedited grace-period updates fixes.2016.09.14a: Miscellaneous fixes hotplug.2016.08.22c: CPU-hotplug changes torture.2016.08.22c: Torture-test changes
2016-09-14x86/signal: Add SA_{X32,IA32}_ABI sa_flagsDmitry Safonov1-0/+7
Introduce new flags that defines which ABI to use on creating sigframe. Those flags kernel will set according to sigaction syscall ABI, which set handler for the signal being delivered. So that will drop the dependency on TIF_IA32/TIF_X32 flags on signal deliver. Those flags will be used only under CONFIG_COMPAT. Similar way ARM uses sa_flags to differ in which mode deliver signal for 26-bit applications (look at SA_THIRYTWO). Signed-off-by: Dmitry Safonov <[email protected]> Reviewed-by: Andy Lutomirski <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-09-14Merge tag 'irqchip-4.9-1' of ↵Thomas Gleixner129-3937/+6151
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Merge the first drop of irqchip updates for 4.9 from Marc Zyngier: - ACPI IORT core code - IORT support for the GICv3 ITS - A few of GIC cleanups
2016-09-14genirq: Expose interrupt information through sysfsCraig Gallek1-2/+191
Information about interrupts is exposed via /proc/interrupts, but the format of that file has changed over kernel versions and differs across architectures. It also has varying column numbers depending on hardware. That all makes it hard for tools to parse. To solve this, expose the information through sysfs so each irq attribute is in a separate file in a consistent, machine parsable way. This feature is only available when both CONFIG_SPARSE_IRQ and CONFIG_SYSFS are enabled. Examples: /sys/kernel/irq/18/actions: i801_smbus,ehci_hcd:usb1,uhci_hcd:usb7 /sys/kernel/irq/18/chip_name: IR-IO-APIC /sys/kernel/irq/18/hwirq: 18 /sys/kernel/irq/18/name: fasteoi /sys/kernel/irq/18/per_cpu_count: 0,0 /sys/kernel/irq/18/type: level /sys/kernel/irq/25/actions: ahci0 /sys/kernel/irq/25/chip_name: IR-PCI-MSI /sys/kernel/irq/25/hwirq: 512000 /sys/kernel/irq/25/name: edge /sys/kernel/irq/25/per_cpu_count: 29036,0 /sys/kernel/irq/25/type: edge [ tglx: Moved kobject_del() under sparse_irq_lock, massaged code comments and changelog ] Signed-off-by: Craig Gallek <[email protected]> Cc: David Decotigny <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-09-13cpufreq: schedutil: Add iowait boostingRafael J. Wysocki1-4/+49
Modify the schedutil cpufreq governor to boost the CPU frequency if the SCHED_CPUFREQ_IOWAIT flag is passed to it via cpufreq_update_util(). If that happens, the frequency is set to the maximum during the first update after receiving the SCHED_CPUFREQ_IOWAIT flag and then the boost is reduced by half during each following update. Signed-off-by: Rafael J. Wysocki <[email protected]> Looks-good-to: Steve Muckle <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2016-09-13cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait conditionRafael J. Wysocki1-0/+8
Testing indicates that it is possible to improve performace significantly without increasing energy consumption too much by teaching cpufreq governors to bump up the CPU performance level if the in_iowait flag is set for the task in enqueue_task_fair(). For this purpose, define a new cpufreq_update_util() flag SCHED_CPUFREQ_IOWAIT and modify enqueue_task_fair() to pass that flag to cpufreq_update_util() in the in_iowait case. That generally requires cpufreq_update_util() to be called directly from there, because update_load_avg() may not be invoked in that case. Signed-off-by: Rafael J. Wysocki <[email protected]> Looks-good-to: Steve Muckle <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2016-09-13Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-0/+22
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "A try_to_wake_up() memory ordering race fix causing a busy-loop in ttwu()" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix a race between try_to_wake_up() and a woken up task
2016-09-13Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2-14/+48
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This contains: - a set of fixes found by directed-random perf fuzzing efforts by Vince Weaver, Alexander Shishkin and Peter Zijlstra - a cqm driver crash fix - an AMD uncore driver use after free fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix PEBSv3 record drain perf/x86/intel/bts: Kill a silly warning perf/x86/intel/bts: Fix BTS PMI detection perf/x86/intel/bts: Fix confused ordering of PMU callbacks perf/core: Fix aux_mmap_count vs aux_refcount order perf/core: Fix a race between mmap_close() and set_output() of AUX events perf/x86/amd/uncore: Prevent use after free perf/x86/intel/cqm: Check cqm/mbm enabled state in event init perf/core: Remove WARN from perf_event_read()
2016-09-13tick/nohz: Prevent stopping the tick on an offline CPUWanpeng Li1-2/+5
can_stop_full_tick() has no check for offline cpus. So it allows to stop the tick on an offline cpu from the interrupt return path, which is wrong and subsequently makes irq_work_needs_cpu() warn about being called for an offline cpu. Commit f7ea0fd639c2c4 ("tick: Don't invoke tick_nohz_stop_sched_tick() if the cpu is offline") added prevention for can_stop_idle_tick(), but forgot to do the same in can_stop_full_tick(). Add it. [ tglx: Massaged changelog ] Signed-off-by: Wanpeng Li <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Frederic Weisbecker <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-09-13cpuset: handle race between CPU hotplug and cpuset_hotplug_workJoonwoo Park1-3/+14
A discrepancy between cpu_online_mask and cpuset's effective_cpus mask is inevitable during hotplug since cpuset defers updating of effective_cpus mask using a workqueue, during which time nothing prevents the system from more hotplug operations. For that reason guarantee_online_cpus() walks up the cpuset hierarchy until it finds an intersection under the assumption that top cpuset's effective_cpus mask intersects with cpu_online_mask even with such a race occurring. However a sequence of CPU hotplugs can open a time window, during which none of the effective CPUs in the top cpuset intersect with cpu_online_mask. For example when there are 4 possible CPUs 0-3 and only CPU0 is online: ======================== =========================== cpu_online_mask top_cpuset.effective_cpus ======================== =========================== echo 1 > cpu2/online. CPU hotplug notifier woke up hotplug work but not yet scheduled. [0,2] [0] echo 0 > cpu0/online. The workqueue is still runnable. [2] [0] ======================== =========================== Now there is no intersection between cpu_online_mask and top_cpuset.effective_cpus. Thus invoking sys_sched_setaffinity() at this moment can cause following: Unable to handle kernel NULL pointer dereference at virtual address 000000d0 ------------[ cut here ]------------ Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable] Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP Modules linked in: CPU: 2 PID: 1420 Comm: taskset Tainted: G W 4.4.8+ #98 task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000 PC is at guarantee_online_cpus+0x2c/0x58 LR is at cpuset_cpus_allowed+0x4c/0x6c <snip> Process taskset (pid: 1420, stack limit = 0xffffffc06e124020) Call trace: [<ffffffc0001389b0>] guarantee_online_cpus+0x2c/0x58 [<ffffffc00013b208>] cpuset_cpus_allowed+0x4c/0x6c [<ffffffc0000d61f0>] sched_setaffinity+0xc0/0x1ac [<ffffffc0000d6374>] SyS_sched_setaffinity+0x98/0xac [<ffffffc000085cb0>] el0_svc_naked+0x24/0x28 The top cpuset's effective_cpus are guaranteed to be identical to cpu_online_mask eventually. Hence fall back to cpu_online_mask when there is no intersection between top cpuset's effective_cpus and cpu_online_mask. Signed-off-by: Joonwoo Park <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Tejun Heo <[email protected]> Cc: [email protected] Cc: [email protected] Cc: <[email protected]> # 3.17+ Signed-off-by: Tejun Heo <[email protected]>
2016-09-13x86/pkeys: Fix pkeys build breakage for some non-x86 archesDave Hansen1-0/+5
Guenter Roeck reported breakage on the h8300 and c6x architectures (among others) caused by the new memory protection keys syscalls. This patch does what Arnd suggested and adds them to kernel/sys_ni.c. Fixes: a60f7b69d92c ("generic syscalls: Wire up memory protection keys syscalls") Reported-and-tested-by: Guenter Roeck <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Cc: [email protected] Cc: Dave Hansen <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-09-13PM / Hibernate: allow hibernation with PAGE_POISONING_ZEROAnisse Astier3-18/+27
PAGE_POISONING_ZERO disables zeroing new pages on alloc, they are poisoned (zeroed) as they become available. In the hibernate use case, free pages will appear in the system without being cleared, left there by the loading kernel. This patch will make sure free pages are cleared on resume when PAGE_POISONING_ZERO is enabled. We free the pages just after resume because we can't do it later: going through any device resume code might allocate some memory and invalidate the free pages bitmap. Thus we don't need to disable hibernation when PAGE_POISONING_ZERO is enabled. Signed-off-by: Anisse Astier <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Pavel Machek <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-09-13PM / sleep: enable suspend-to-idle even without registered suspend_opsSudeep Holla2-3/+12
Suspend-to-idle (aka the "freeze" sleep state) is a system sleep state in which all of the processors enter deepest possible idle state and wait for interrupts right after suspending all the devices. There is no hard requirement for a platform to support and register platform specific suspend_ops to enter suspend-to-idle/freeze state. Only deeper system sleep states like PM_SUSPEND_STANDBY and PM_SUSPEND_MEM rely on such low level support/implementation. suspend-to-idle can be entered as along as all the devices can be suspended. This patch enables the support for suspend-to-idle even on systems that don't have any low level support for deeper system sleep states and/or don't register any platform specific suspend_ops. Signed-off-by: Sudeep Holla <[email protected]> Tested-by: Andy Gross <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-09-13PM / sleep: Increase default DPM watchdog timeout to 120Chen Yu1-2/+2
Recently we have a new report that, the harddisk can not resume on time due to firmware issues, and got a kernel panic because of DPM watchdog timeout. So adjust the default timeout from 60 to 120 to survive on this platform, and make DPM_WATCHDOG depending on EXPERT. Link: https://bugzilla.kernel.org/show_bug.cgi?id=117971 Suggested-by: Pavel Machek <[email protected]> Suggested-by: Rafael J. Wysocki <[email protected]> Reported-by: Higuita <[email protected]> Signed-off-by: Chen Yu <[email protected]> Acked-by: Pavel Machek <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-09-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller11-38/+113
Conflicts: drivers/net/ethernet/mediatek/mtk_eth_soc.c drivers/net/ethernet/qlogic/qed/qed_dcbx.c drivers/net/phy/Kconfig All conflicts were cases of overlapping commits. Signed-off-by: David S. Miller <[email protected]>
2016-09-12tracing: Have max_latency be defined for HWLAT_TRACER as wellSteven Rostedt (Red Hat)2-3/+5
The hwlat tracer uses tr->max_latency, and if it's the only tracer enabled that uses it, the build will fail. Add max_latency and its file when the hwlat tracer is enabled. Link: http://lkml.kernel.org/r/[email protected] Reported-by: Randy Dunlap <[email protected]> Tested-by: Randy Dunlap <[email protected]> Acked-by: Randy Dunlap <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2016-09-10Merge branch 'libnvdimm-fixes' of ↵Linus Torvalds1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "nvdimm fixes for v4.8, two of them are tagged for -stable: - Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise, DAX pmd mappings end up with an uncached pgprot, and unusable performance for the device-dax interface. The device-dax interface appeared in 4.7 so this is tagged for -stable. - Fix a couple VM_BUG_ON() checks in the show_smaps() path to understand DAX pmd entries. This fix is tagged for -stable. - Fix a mis-merge of the nfit machine-check handler to flip the polarity of an if() to match the final version of the patch that Vishal sent for 4.8-rc1. Without this the nfit machine check handler never detects / inserts new 'badblocks' entries which applications use to identify lost portions of files. - For test purposes, fix the nvdimm_clear_poison() path to operate on legacy / simulated nvdimm memory ranges. Without this fix a test can set badblocks, but never clear them on these ranges. - Fix the range checking done by dax_dev_pmd_fault(). This is not tagged for -stable since this problem is mitigated by specifying aligned resources at device-dax setup time. These patches have appeared in a next release over the past week. The recent rebase you can see in the timestamps was to drop an invalid fix as identified by the updated device-dax unit tests [1]. The -mm touches have an ack from Andrew" [1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs" https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: libnvdimm: allow legacy (e820) pmem region to clear bad blocks nfit, mce: Fix SPA matching logic in MCE handler mm: fix cache mode of dax pmd mappings mm: fix show_smap() for zone_device-pmd ranges dax: fix mapping size check
2016-09-10Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar2-10/+36
Signed-off-by: Ingo Molnar <[email protected]>
2016-09-10Revert "sched/fair: Make update_min_vruntime() more readable"Peter Zijlstra1-4/+7
There's a bug in this commit: 97a7142f157a ("sched/fair: Make update_min_vruntime() more readable") ... when !rb_leftmost && curr we fail to advance min_vruntime. So revert it. Reported-by: Byungchul Park <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2016-09-10perf/core: Fix aux_mmap_count vs aux_refcount orderAlexander Shishkin1-4/+11
The order of accesses to ring buffer's aux_mmap_count and aux_refcount has to be preserved across the users, namely perf_mmap_close() and perf_aux_output_begin(), otherwise the inversion can result in the latter holding the last reference to the aux buffer and subsequently free'ing it in atomic context, triggering a warning. > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 257 at kernel/events/ring_buffer.c:541 __rb_free_aux+0x11a/0x130 > CPU: 0 PID: 257 Comm: stopbug Not tainted 4.8.0-rc1+ #2596 > Call Trace: > [<ffffffff810f3e0b>] __warn+0xcb/0xf0 > [<ffffffff810f3f3d>] warn_slowpath_null+0x1d/0x20 > [<ffffffff8121182a>] __rb_free_aux+0x11a/0x130 > [<ffffffff812127a8>] rb_free_aux+0x18/0x20 > [<ffffffff81212913>] perf_aux_output_begin+0x163/0x1e0 > [<ffffffff8100c33a>] bts_event_start+0x3a/0xd0 > [<ffffffff8100c42d>] bts_event_add+0x5d/0x80 > [<ffffffff81203646>] event_sched_in.isra.104+0xf6/0x2f0 > [<ffffffff8120652e>] group_sched_in+0x6e/0x190 > [<ffffffff8120694e>] ctx_sched_in+0x2fe/0x5f0 > [<ffffffff81206ca0>] perf_event_sched_in+0x60/0x80 > [<ffffffff81206d1b>] ctx_resched+0x5b/0x90 > [<ffffffff81207281>] __perf_event_enable+0x1e1/0x240 > [<ffffffff81200639>] event_function+0xa9/0x180 > [<ffffffff81202000>] ? perf_cgroup_attach+0x70/0x70 > [<ffffffff8120203f>] remote_function+0x3f/0x50 > [<ffffffff811971f3>] flush_smp_call_function_queue+0x83/0x150 > [<ffffffff81197bd3>] generic_smp_call_function_single_interrupt+0x13/0x60 > [<ffffffff810a6477>] smp_call_function_single_interrupt+0x27/0x40 > [<ffffffff81a26ea9>] call_function_single_interrupt+0x89/0x90 > [<ffffffff81120056>] finish_task_switch+0xa6/0x210 > [<ffffffff81120017>] ? finish_task_switch+0x67/0x210 > [<ffffffff81a1e83d>] __schedule+0x3dd/0xb50 > [<ffffffff81a1efe5>] schedule+0x35/0x80 > [<ffffffff81128031>] sys_sched_yield+0x61/0x70 > [<ffffffff81a25be5>] entry_SYSCALL_64_fastpath+0x18/0xa8 > ---[ end trace 6235f556f5ea83a9 ]--- This patch puts the checks in perf_aux_output_begin() in the same order as that of perf_mmap_close(). Reported-by: Vince Weaver <[email protected]> Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-09-10perf/core: Fix a race between mmap_close() and set_output() of AUX eventsAlexander Shishkin1-6/+25
In the mmap_close() path we need to stop all the AUX events that are writing data to the AUX area that we are unmapping, before we can safely free the pages. To determine if an event needs to be stopped, we're comparing its ->rb against the one that's getting unmapped. However, a SET_OUTPUT ioctl may turn up inside an AUX transaction and swizzle event::rb to some other ring buffer, but the transaction will keep writing data to the old ring buffer until the event gets scheduled out. At this point, mmap_close() will skip over such an event and will proceed to free the AUX area, while it's still being used by this event, which will set off a warning in the mmap_close() path and cause a memory corruption. To avoid this, always stop an AUX event before its ->rb is updated; this will release the (potentially) last reference on the AUX area of the buffer. If the event gets restarted, its new ring buffer will be used. If another SET_OUTPUT comes and switches it back to the old ring buffer that's getting unmapped, it's also fine: this ring buffer's aux_mmap_count will be zero and AUX transactions won't start any more. Reported-by: Vince Weaver <[email protected]> Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-09-09bpf: add BPF_CALL_x macros for declaring helpersDaniel Borkmann4-77/+51
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros that are used today. Motivation for this is to hide all the register handling and all necessary casts from the user, so that it is done automatically in the background when adding a BPF_CALL_<n>() call. This makes current helpers easier to review, eases to write future helpers, avoids getting the casting mess wrong, and allows for extending all helpers at once (f.e. build time checks, etc). It also helps detecting more easily in code reviews that unused registers are not instrumented in the code by accident, breaking compatibility with existing programs. BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some fundamental differences, for example, for generating the actual helper function that carries all u64 regs, we need to fill unused regs, so that we always end up with 5 u64 regs as an argument. I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and they look all as expected. No sparse issue spotted. We let this also sit for a few days with Fengguang's kbuild test robot, and there were no issues seen. On s390, it barked on the "uses dynamic stack allocation" notice, which is an old one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion to the call wrapper, just telling that the perf raw record/frag sits on stack (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests and they were fine as well. All eBPF helpers are now converted to use these macros, getting rid of a good chunk of all the raw castings. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-09-09bpf: add BPF_SIZEOF and BPF_FIELD_SIZEOF macrosDaniel Borkmann1-6/+6
Add BPF_SIZEOF() and BPF_FIELD_SIZEOF() macros to improve the code a bit which otherwise often result in overly long bytes_to_bpf_size(sizeof()) and bytes_to_bpf_size(FIELD_SIZEOF()) lines. So place them into a macro helper instead. Moreover, we currently have a BUILD_BUG_ON(BPF_FIELD_SIZEOF()) check in convert_bpf_extensions(), but we should rather make that generic as well and add a BUILD_BUG_ON() test in all BPF_SIZEOF()/BPF_FIELD_SIZEOF() users to detect any rewriter size issues at compile time. Note, there are currently none, but we want to assert that it stays this way. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>