aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2018-10-09locking/lockdep: Make class->ops a percpu counter and move it under ↵Waiman Long3-4/+36
CONFIG_DEBUG_LOCKDEP=y A sizable portion of the CPU cycles spent on the __lock_acquire() is used up by the atomic increment of the class->ops stat counter. By taking it out from the lock_class structure and changing it to a per-cpu per-lock-class counter, we can reduce the amount of cacheline contention on the class structure when multiple CPUs are trying to acquire locks of the same class simultaneously. To limit the increase in memory consumption because of the percpu nature of that counter, it is now put back under the CONFIG_DEBUG_LOCKDEP config option. So the memory consumption increase will only occur if CONFIG_DEBUG_LOCKDEP is defined. The lock_class structure, however, is reduced in size by 16 bytes on 64-bit archs after ops removal and a minor restructuring of the fields. This patch also fixes a bug in the increment code as the counter is of the 'unsigned long' type, but atomic_inc() was used to increment it. Signed-off-by: Waiman Long <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller7-184/+767
Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-10-08 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) sk_lookup_[tcp|udp] and sk_release helpers from Joe Stringer which allow BPF programs to perform lookups for sockets in a network namespace. This would allow programs to determine early on in processing whether the stack is expecting to receive the packet, and perform some action (eg drop, forward somewhere) based on this information. 2) per-cpu cgroup local storage from Roman Gushchin. Per-cpu cgroup local storage is very similar to simple cgroup storage except all the data is per-cpu. The main goal of per-cpu variant is to implement super fast counters (e.g. packet counters), which don't require neither lookups, neither atomic operations in a fast path. The example of these hybrid counters is in selftests/bpf/netcnt_prog.c 3) allow HW offload of programs with BPF-to-BPF function calls from Quentin Monnet 4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub Kicinski 5) rename of libbpf interfaces from Andrey Ignatov. libbpf is maturing as a library and should follow good practices in library design and implementation to play well with other libraries. This patch set brings consistent naming convention to global symbols. 6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov to let Apache2 projects use libbpf 7) various AF_XDP fixes from Björn and Magnus ==================== Signed-off-by: David S. Miller <[email protected]>
2018-10-09genirq: Fix grammar s/an /a /Geert Uytterhoeven1-1/+1
Fix a grammar mistake in <linux/interrupt.h>. [ mingo: While at it also fix another similar error in another comment as well. ] Signed-off-by: Geert Uytterhoeven <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/20181008111726.26286-1-geert%[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-09dma-direct: document the zone selection logicChristoph Hellwig1-1/+8
What we are doing here isn't quite obvious, so add a comment explaining it. Signed-off-by: Christoph Hellwig <[email protected]>
2018-10-08bpf: allow offload of programs with BPF-to-BPF function callsQuentin Monnet1-7/+3
Now that there is at least one driver supporting BPF-to-BPF function calls, lift the restriction, in the verifier, on hardware offload of eBPF programs containing such calls. But prevent jit_subprogs(), still in the verifier, from being run for offloaded programs. Signed-off-by: Quentin Monnet <[email protected]> Reviewed-by: Jiong Wang <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-08bpf: add verifier callback to get stack usage info for offloaded progsQuentin Monnet2-0/+21
In preparation for BPF-to-BPF calls in offloaded programs, add a new function attribute to the struct bpf_prog_offload_ops so that drivers supporting eBPF offload can hook at the end of program verification, and potentially extract information collected by the verifier. Implement a minimal callback (returning 0) in the drivers providing the structs, namely netdevsim and nfp. This will be useful in the nfp driver, in later commits, to extract the number of subprograms as well as the stack depth for those subprograms. Signed-off-by: Quentin Monnet <[email protected]> Reviewed-by: Jiong Wang <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-08dma-debug: Check for drivers mapping invalid addresses in dma_map_single()Stephen Boyd1-0/+16
I recently debugged a DMA mapping oops where a driver was trying to map a buffer returned from request_firmware() with dma_map_single(). Memory returned from request_firmware() is mapped into the vmalloc region and this isn't a valid region to map with dma_map_single() per the DMA documentation's "What memory is DMA'able?" section. Unfortunately, we don't really check that in the DMA debugging code, so enabling DMA debugging doesn't help catch this problem. Let's add a new DMA debug function to check for a vmalloc address or an invalid virtual address and print a warning if this happens. This makes it a little easier to debug these sorts of problems, instead of seeing odd behavior or crashes when drivers attempt to map the vmalloc space for DMA. Cc: Marek Szyprowski <[email protected]> Reviewed-by: Robin Murphy <[email protected]> Signed-off-by: Stephen Boyd <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-10-08signal: In sigqueueinfo prefer sig not si_signoEric W. Biederman1-57/+84
Andrei Vagin <[email protected]> reported: > Accoding to the man page, the user should not set si_signo, it has to be set > by kernel. > > $ man 2 rt_sigqueueinfo > > The uinfo argument specifies the data to accompany the signal. This > argument is a pointer to a structure of type siginfo_t, described in > sigaction(2) (and defined by including <sigaction.h>). The caller > should set the following fields in this structure: > > si_code > This must be one of the SI_* codes in the Linux kernel source > file include/asm-generic/siginfo.h, with the restriction that > the code must be negative (i.e., cannot be SI_USER, which is > used by the kernel to indicate a signal sent by kill(2)) and > cannot (since Linux 2.6.39) be SI_TKILL (which is used by the > kernel to indicate a signal sent using tgkill(2)). > > si_pid This should be set to a process ID, typically the process ID of > the sender. > > si_uid This should be set to a user ID, typically the real user ID of > the sender. > > si_value > This field contains the user data to accompany the signal. For > more information, see the description of the last (union sigval) > argument of sigqueue(3). > > Internally, the kernel sets the si_signo field to the value specified > in sig, so that the receiver of the signal can also obtain the signal > number via that field. > > On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote: >> >> If there is some application that calls sigqueueinfo directly that has >> a problem with this added sanity check we can revisit this when we see >> what kind of crazy that application is doing. > > > I already know two "applications" ;) > > https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c > https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c > > Disclaimer: I'm the author of both of them. Looking at the kernel code the historical behavior has alwasy been to prefer the signal number passed in by the kernel. So sigh. Implmenet __copy_siginfo_from_user and __copy_siginfo_from_user32 to take that signal number and prefer it. The user of ptrace will still use copy_siginfo_from_user and copy_siginfo_from_user32 as they do not and never have had a signal number there. Luckily this change has never made it farther than linux-next. Fixes: e75dc036c445 ("signal: Fail sigqueueinfo if si_signo != sig") Reported-by: Andrei Vagin <[email protected]> Tested-by: Andrei Vagin <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2018-10-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller8-29/+118
2018-10-06Merge branch 'core/core' into x86/build, to prevent conflictsIngo Molnar2-55/+54
Signed-off-by: Ingo Molnar <[email protected]>
2018-10-06kexec: Allocate decrypted control pages for kdump if SME is enabledLianbo Jiang1-0/+6
When SME is enabled in the first kernel, it needs to allocate decrypted pages for kdump because when the kdump kernel boots, these pages need to be accessed decrypted in the initial boot stage, before SME is enabled. [ bp: clean up text. ] Signed-off-by: Lianbo Jiang <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tom Lendacky <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2018-10-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netGreg Kroah-Hartman2-2/+13
Dave writes: "Networking fixes: 1) Fix truncation of 32-bit right shift in bpf, from Jann Horn. 2) Fix memory leak in wireless wext compat, from Stefan Seyfried. 3) Use after free in cfg80211's reg_process_hint(), from Yu Zhao. 4) Need to cancel pending work when unbinding in smsc75xx otherwise we oops, also from Yu Zhao. 5) Don't allow enslaving a team device to itself, from Ido Schimmel. 6) Fix backwards compat with older userspace for rtnetlink FDB dumps. From Mauricio Faria. 7) Add validation of tc policy netlink attributes, from David Ahern. 8) Fix RCU locking in rawv6_send_hdrinc(), from Wei Wang." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits) net: mvpp2: Extract the correct ethtype from the skb for tx csum offload ipv6: take rcu lock in rawv6_send_hdrinc() net: sched: Add policy validation for tc attributes rtnetlink: fix rtnl_fdb_dump() for ndmsg header yam: fix a missing-check bug net: bpfilter: Fix type cast and pointer warnings net: cxgb3_main: fix a missing-check bug bpf: 32-bit RSH verification must truncate input before the ALU op net: phy: phylink: fix SFP interface autodetection be2net: don't flip hw_features when VXLANs are added/deleted net/packet: fix packet drop as of virtio gso net: dsa: b53: Keep CPU port as tagged in all VLANs openvswitch: load NAT helper bnxt_en: get the reduced max_irqs by the ones used by RDMA bnxt_en: free hwrm resources, if driver probe fails. bnxt_en: Fix enables field in HWRM_QUEUE_COS2BW_CFG request bnxt_en: Fix VNIC reservations on the PF. team: Forbid enslaving team device to itself net/usb: cancel pending work when unbinding smsc75xx mlxsw: spectrum: Delete RIF when VLAN device is removed ...
2018-10-05Merge branch 'perf-urgent-for-linus' of ↵Greg Kroah-Hartman1-7/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Ingo writes: "perf fixes: - fix a CPU#0 hot unplug bug and a PCI enumeration bug in the x86 Intel uncore PMU driver - fix a CPU event enumeration bug in the x86 AMD PMU driver - fix a perf ring-buffer corruption bug when using tracepoints - fix a PMU unregister locking bug" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events perf/x86/intel/uncore: Fix PCI BDF address of M3UPI on SKX perf/ring_buffer: Prevent concurent ring buffer access perf/x86/intel/uncore: Use boot_cpu_data.phys_proc_id instead of hardcorded physical package ID 0 perf/core: Fix perf_pmu_unregister() locking
2018-10-05Merge branch 'sched-urgent-for-linus' of ↵Greg Kroah-Hartman4-16/+95
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Ingo writes: "scheduler fixes: These fixes address a rather involved performance regression between v4.17->v4.19 in the sched/numa auto-balancing code. Since distros really need this fix we accelerated it to sched/urgent for a faster upstream merge. NUMA scheduling and balancing performance is now largely back to v4.17 levels, without reintroducing the NUMA placement bugs that v4.18 and v4.19 fixed. Many thanks to Srikar Dronamraju, Mel Gorman and Jirka Hladky, for reporting, testing, re-testing and solving this rather complex set of bugs." * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration sched/numa: Avoid task migration for small NUMA improvement mm/migrate: Use spin_trylock() while resetting rate limit sched/numa: Limit the conditions where scan period is reset sched/numa: Reset scan rate whenever task moves across nodes sched/numa: Pass destination CPU as a parameter to migrate_task_rq sched/numa: Stop multiple tasks from moving to the CPU at the same time
2018-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2-2/+13
Daniel Borkmann says: ==================== pull-request: bpf 2018-10-05 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix to truncate input on ALU operations in 32 bit mode, from Jann. 2) Fixes for cgroup local storage to reject reserved flags on element update and rejection of map allocation with zero-sized value, from Roman. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-10-05bpf: 32-bit RSH verification must truncate input before the ALU opJann Horn1-1/+9
When I wrote commit 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification"), I assumed that, in order to emulate 64-bit arithmetic with 32-bit logic, it is sufficient to just truncate the output to 32 bits; and so I just moved the register size coercion that used to be at the start of the function to the end of the function. That assumption is true for almost every op, but not for 32-bit right shifts, because those can propagate information towards the least significant bit. Fix it by always truncating inputs for 32-bit ops to 32 bits. Also get rid of the coerce_reg_to_size() after the ALU op, since that has no effect. Fixes: 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification") Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: Jann Horn <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-05printk: Add KBUILD_MODNAME and remove a redundant print prefixHe Zhe1-1/+3
Add KBUILD_MODNAME to make prints more clear. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: He Zhe <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2018-10-05printk: Correct wrong castingHe Zhe1-2/+3
log_first_seq and console_seq are 64-bit unsigned integers. Correct a wrong casting that might cut off the output. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: He Zhe <[email protected]> [[email protected]: More descriptive commit message] Reviewed-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2018-10-05printk: Fix panic caused by passing log_buf_len to command lineHe Zhe1-1/+6
log_buf_len_setup does not check input argument before passing it to simple_strtoull. The argument would be a NULL pointer if "log_buf_len", without its value, is set in command line and thus causes the following panic. PANIC: early exception 0xe3 IP 10:ffffffffaaeacd0d error 0 cr2 0x0 [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc4-yocto-standard+ #1 [ 0.000000] RIP: 0010:_parse_integer_fixup_radix+0xd/0x70 ... [ 0.000000] Call Trace: [ 0.000000] simple_strtoull+0x29/0x70 [ 0.000000] memparse+0x26/0x90 [ 0.000000] log_buf_len_setup+0x17/0x22 [ 0.000000] do_early_param+0x57/0x8e [ 0.000000] parse_args+0x208/0x320 [ 0.000000] ? rdinit_setup+0x30/0x30 [ 0.000000] parse_early_options+0x29/0x2d [ 0.000000] ? rdinit_setup+0x30/0x30 [ 0.000000] parse_early_param+0x36/0x4d [ 0.000000] setup_arch+0x336/0x99e [ 0.000000] start_kernel+0x6f/0x4ee [ 0.000000] x86_64_start_reservations+0x24/0x26 [ 0.000000] x86_64_start_kernel+0x6f/0x72 [ 0.000000] secondary_startup_64+0xa4/0xb0 This patch adds a check to prevent the panic. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: He Zhe <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2018-10-05cpu/SMT: State SMT is disabled even with nosmt and without "=force"Borislav Petkov1-0/+1
When booting with "nosmt=force" a message is issued into dmesg to confirm that SMT has been force-disabled but such a message is not issued when only "nosmt" is on the kernel command line. Fix that. Signed-off-by: Borislav Petkov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-05dma-direct: fix return value of dma_direct_supportedAlexander Duyck1-3/+1
It appears that in commit 9d7a224b463e ("dma-direct: always allow dma mask <= physiscal memory size") the logic of the test was changed from a "<" to a ">=" however I don't see any reason for that change. I am assuming that there was some additional change planned, specifically I suspect the logic was intended to be reversed and possibly used for a return. Since that is the case I have gone ahead and done that. This addresses issues I had on my system that prevented me from booting with the above mentioned commit applied on an x86_64 system w/ Intel IOMMU. Fixes: 9d7a224b463e ("dma-direct: always allow dma mask <= physiscal memory size") Signed-off-by: Alexander Duyck <[email protected]> Acked-by: Robin Murphy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-10-04clocksource: Provide clocksource_arch_init()Thomas Gleixner2-0/+6
Architectures have extra archdata in the clocksource, e.g. for VDSO support. There are no sanity checks or general initializations for this available. Add support for that. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Acked-by: John Stultz <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Matt Rickard <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Florian Weimer <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Paolo Bonzini <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Juergen Gross <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-10-04cgroup: Fix dom_cgrp propagation when enabling threaded modeTejun Heo1-9/+16
A cgroup which is already a threaded domain may be converted into a threaded cgroup if the prerequisite conditions are met. When this happens, all threaded descendant should also have their ->dom_cgrp updated to the new threaded domain cgroup. Unfortunately, this propagation was missing leading to the following failure. # cd /sys/fs/cgroup/unified # cat cgroup.subtree_control # show that no controllers are enabled # mkdir -p mycgrp/a/b/c # echo threaded > mycgrp/a/b/cgroup.type At this point, the hierarchy looks as follows: mycgrp [d] a [dt] b [t] c [inv] Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"): # echo threaded > mycgrp/a/cgroup.type By this point, we now have a hierarchy that looks as follows: mycgrp [dt] a [t] b [t] c [inv] But, when we try to convert the node "c" from "domain invalid" to "threaded", we get ENOTSUP on the write(): # echo threaded > mycgrp/a/b/c/cgroup.type sh: echo: write error: Operation not supported This patch fixes the problem by * Moving the opencoded ->dom_cgrp save and restoration in cgroup_enable_threaded() into cgroup_{save|restore}_control() so that mulitple cgroups can be handled. * Updating all threaded descendants' ->dom_cgrp to point to the new dom_cgrp when enabling threaded mode. Signed-off-by: Tejun Heo <[email protected]> Reported-and-tested-by: "Michael Kerrisk (man-pages)" <[email protected]> Reported-by: Amin Jamali <[email protected]> Reported-by: Joao De Almeida Pereira <[email protected]> Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com Fixes: 454000adaa2a ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling") Cc: [email protected] # v4.14+
2018-10-04sched/core: Fix comment regarding nr_iowait_cpu() and get_iowait_load()Rafael J. Wysocki1-4/+4
The comment related to nr_iowait_cpu() and get_iowait_load() confuses cpufreq with cpuidle and is not very useful for this reason, so fix it. Signed-off-by: Rafael J. Wysocki <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Linux PM <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: e33a9bba85a8 "sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler" Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-0/+9
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net' overlapped the renaming of a netlink attribute in net-next. Signed-off-by: David S. Miller <[email protected]>
2018-10-03signal: Use a smaller struct siginfo in the kernelEric W. Biederman1-18/+64
We reserve 128 bytes for struct siginfo but only use about 48 bytes on 64bit and 32 bytes on 32bit. Someday we might use more but it is unlikely to be anytime soon. Userspace seems content with just enough bytes of siginfo to implement sigqueue. Or in the case of checkpoint/restart reinjecting signals the kernel has sent. Reducing the stack footprint and the work to copy siginfo around from 2 cachelines to 1 cachelines seems worth doing even if I don't have benchmarks to show a performance difference. Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2018-10-03signal: Distinguish between kernel_siginfo and siginfoEric W. Biederman4-61/+108
Linus recently observed that if we did not worry about the padding member in struct siginfo it is only about 48 bytes, and 48 bytes is much nicer than 128 bytes for allocating on the stack and copying around in the kernel. The obvious thing of only adding the padding when userspace is including siginfo.h won't work as there are sigframe definitions in the kernel that embed struct siginfo. So split siginfo in two; kernel_siginfo and siginfo. Keeping the traditional name for the userspace definition. While the version that is used internally to the kernel and ultimately will not be padded to 128 bytes is called kernel_siginfo. The definition of struct kernel_siginfo I have put in include/signal_types.h A set of buildtime checks has been added to verify the two structures have the same field offsets. To make it easy to verify the change kernel_siginfo retains the same size as siginfo. The reduction in size comes in a following change. Signed-off-by: "Eric W. Biederman" <[email protected]>
2018-10-03signal: Introduce copy_siginfo_from_user and use it's return valueEric W. Biederman2-16/+21
In preparation for using a smaller version of siginfo in the kernel introduce copy_siginfo_from_user and use it when siginfo is copied from userspace. Make the pattern for using copy_siginfo_from_user and copy_siginfo_from_user32 to capture the return value and return that value on error. This is a necessary prerequisite for using a smaller siginfo in the kernel than the kernel exports to userspace. Signed-off-by: "Eric W. Biederman" <[email protected]>
2018-10-03signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZEEric W. Biederman1-3/+0
Rework the defintion of struct siginfo so that the array padding struct siginfo to SI_MAX_SIZE can be placed in a union along side of the rest of the struct siginfo members. The result is that we no longer need the __ARCH_SI_PREAMBLE_SIZE or SI_PAD_SIZE definitions. Signed-off-by: "Eric W. Biederman" <[email protected]>
2018-10-03signal: Fail sigqueueinfo if si_signo != sigEric W. Biederman1-2/+4
The kernel needs to validate that the contents of struct siginfo make sense as siginfo is copied into the kernel, so that the proper union members can be put in the appropriate locations. The field si_signo is a fundamental part of that validation. As such changing the contents of si_signo after the validation make no sense and can result in nonsense values in the kernel. As such simply fail if someone is silly enough to set si_signo out of sync with the signal number passed to sigqueueinfo. I don't expect a problem as glibc's sigqueue implementation sets "si_signo = sig" and CRIU just returns to the kernel what the kernel gave to it. If there is some application that calls sigqueueinfo directly that has a problem with this added sanity check we can revisit this when we see what kind of crazy that application is doing. Signed-off-by: "Eric W. Biederman" <[email protected]>
2018-10-03signal/sparc: Move EMT_TAGOVF into the generic siginfo.hEric W. Biederman1-1/+1
When moving all of the architectures specific si_codes into siginfo.h, I apparently overlooked EMT_TAGOVF. Move it now. Remove the now redundant test in siginfo_layout for SIGEMT as now NSIGEMT is always defined. Signed-off-by: "Eric W. Biederman" <[email protected]>
2018-10-03locking/ww_mutex: Fix runtime warning in the WW mutex selftestGuenter Roeck1-4/+6
If CONFIG_WW_MUTEX_SELFTEST=y is enabled, booting an image in an arm64 virtual machine results in the following traceback if 8 CPUs are enabled: DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current) WARNING: CPU: 2 PID: 537 at kernel/locking/mutex.c:1033 __mutex_unlock_slowpath+0x1a8/0x2e0 ... Call trace: __mutex_unlock_slowpath() ww_mutex_unlock() test_cycle_work() process_one_work() worker_thread() kthread() ret_from_fork() If requesting b_mutex fails with -EDEADLK, the error variable is reassigned to the return value from calling ww_mutex_lock on a_mutex again. If this call fails, a_mutex is not locked. It is, however, unconditionally unlocked subsequently, causing the reported warning. Fix the problem by using two error variables. With this change, the selftest still fails as follows: cyclic deadlock not resolved, ret[7/8] = -35 However, the traceback is gone. Signed-off-by: Guenter Roeck <[email protected]> Cc: Chris Wilson <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Fixes: d1b42b800e5d0 ("locking/ww_mutex: Add kselftests for resolving ww_mutex cyclic deadlocks") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-03locking/lockdep: Add a faster path in __lock_release()Waiman Long1-3/+14
When __lock_release() is called, the most likely unlock scenario is on the innermost lock in the chain. In this case, we can skip some of the checks and provide a faster path to completion. Signed-off-by: Waiman Long <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-03locking/lockdep: Eliminate redundant IRQs check in __lock_acquire()Waiman Long1-8/+7
The static __lock_acquire() function has only two callers: 1) lock_acquire() 2) reacquire_held_locks() In lock_acquire(), raw_local_irq_save() is called beforehand. So IRQs must have been disabled. So the check: DEBUG_LOCKS_WARN_ON(!irqs_disabled()) is kind of redundant in this case. So move the above check to reacquire_held_locks() to eliminate redundant code in the lock_acquire() path. Signed-off-by: Waiman Long <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-03locking/lockdep: Remove add_chain_cache_classes()Waiman Long1-70/+0
The inline function add_chain_cache_classes() is defined, but has no caller. Just remove it. Signed-off-by: Waiman Long <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-03bpf: Add helper to retrieve socket in BPFJoe Stringer1-1/+7
This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and bpf_sk_lookup_udp() which allows BPF programs to find out if there is a socket listening on this host, and returns a socket pointer which the BPF program can then access to determine, for instance, whether to forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the socket, so when a BPF program makes use of this function, it must subsequently pass the returned pointer into the newly added sk_release() to return the reference. By way of example, the following pseudocode would filter inbound connections at XDP if there is no corresponding service listening for the traffic: struct bpf_sock_tuple tuple; struct bpf_sock_ops *sk; populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0); if (!sk) { // Couldn't find a socket listening for this traffic. Drop. return TC_ACT_SHOT; } bpf_sk_release(sk, 0); return TC_ACT_OK; Signed-off-by: Joe Stringer <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-03bpf: Add reference tracking to verifierJoe Stringer1-19/+287
Allow helper functions to acquire a reference and return it into a register. Specific pointer types such as the PTR_TO_SOCKET will implicitly represent such a reference. The verifier must ensure that these references are released exactly once in each path through the program. To achieve this, this commit assigns an id to the pointer and tracks it in the 'bpf_func_state', then when the function or program exits, verifies that all of the acquired references have been freed. When the pointer is passed to a function that frees the reference, it is removed from the 'bpf_func_state` and all existing copies of the pointer in registers are marked invalid. Signed-off-by: Joe Stringer <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-03bpf: Macrofy stack state copyJoe Stringer1-46/+60
An upcoming commit will need very similar copy/realloc boilerplate, so refactor the existing stack copy/realloc functions into macros to simplify it. Signed-off-by: Joe Stringer <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-03bpf: Add PTR_TO_SOCKET verifier typeJoe Stringer1-14/+106
Teach the verifier a little bit about a new type of pointer, a PTR_TO_SOCKET. This pointer type is accessed from BPF through the 'struct bpf_sock' structure. Signed-off-by: Joe Stringer <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-03bpf: Generalize ptr_or_null regs checkJoe Stringer1-18/+25
This check will be reused by an upcoming commit for conditional jump checks for sockets. Refactor it a bit to simplify the later commit. Signed-off-by: Joe Stringer <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-03bpf: Reuse canonical string formatter for ctx errsJoe Stringer1-4/+3
The array "reg_type_str" provides canonical formatting of register types, however a couple of places would previously check whether a register represented the context and write the name "context" directly. An upcoming commit will add another pointer type to these statements, so to provide more accurate error messages in the verifier, update these error messages to use "reg_type_str" instead. Signed-off-by: Joe Stringer <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-03bpf: Simplify ptr_min_max_vals adjustmentJoe Stringer1-12/+10
An upcoming commit will add another two pointer types that need very similar behaviour, so generalise this function now. Signed-off-by: Joe Stringer <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-03bpf: Add iterator for spilled registersJoe Stringer1-9/+7
Add this iterator for spilled registers, it concentrates the details of how to get the current frame's spilled registers into a single macro while clarifying the intention of the code which is calling the macro. Signed-off-by: Joe Stringer <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-02printk: CON_PRINTBUFFER console registration is a bit racySergey Senozhatsky1-1/+5
CON_PRINTBUFFER console registration requires us to do several preparation steps: - Rollback console_seq to replay logbuf messages which were already seen on other consoles; - Set exclusive_console flag so console_unlock() will ->write() logbuf messages only to the exclusive_console driver. The way we do it, however, is a bit racy logbuf_lock_irqsave(flags); console_seq = syslog_seq; console_idx = syslog_idx; logbuf_unlock_irqrestore(flags); << preemption enabled << irqs enabled exclusive_console = newcon; console_unlock(); We rollback console_seq under logbuf_lock with IRQs disabled, but we set exclusive_console with local IRQs enabled and logbuf unlocked. If the system oops-es or panic-s before we set exclusive_console - and given that we have IRQs and preemption enabled there is such a possibility - we will re-play all logbuf messages to every registered console, which may be a bit annoying and time consuming. Move exclusive_console assignment to the same IRQs-disabled and logbuf_lock-protected section where we rollback console_seq. Link: http://lkml.kernel.org/r/[email protected] To: Steven Rostedt <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: [email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2018-10-02printk: Do not miss new messages when replaying the logPetr Mladek1-4/+9
The variable "exclusive_console" is used to reply all existing messages on a newly registered console. It is cleared when all messages are out. The problem is that new messages might appear in the meantime. These are then visible only on the exclusive console. The obvious solution is to clear "exclusive_console" after we replay all messages that were already proceed before we started the reply. Reported-by: Sergey Senozhatsky <[email protected]> Link: http://lkml.kernel.org/r/[email protected] To: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: [email protected] Acked-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2018-10-02bpf: don't accept cgroup local storage with zero value sizeRoman Gushchin1-0/+3
Explicitly forbid creating cgroup local storage maps with zero value size, as it makes no sense and might even cause a panic. Reported-by: [email protected] Signed-off-by: Roman Gushchin <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-10-02sched/numa: Migrate pages to local nodes quicker early in the lifetime of a taskMel Gorman1-1/+11
Automatic NUMA Balancing uses a multi-stage pass to decide whether a page should migrate to a local node. This filter avoids excessive ping-ponging if a page is shared or used by threads that migrate cross-node frequently. Threads inherit both page tables and the preferred node ID from the parent. This means that threads can trigger hinting faults earlier than a new task which delays scanning for a number of seconds. As it can be load balanced very early in its lifetime there can be an unnecessary delay before it starts migrating thread-local data. This patch migrates private pages faster early in the lifetime of a thread using the sequence counter as an identifier of new tasks. With this patch applied, STREAM performance is the same as 4.17 even though processes are not spread cross-node prematurely. Other workloads showed a mix of minor gains and losses. This is somewhat expected most workloads are not very sensitive to the starting conditions of a process. 4.19.0-rc5 4.19.0-rc5 4.17.0 numab-v1r1 fastmigrate-v1r1 vanilla MB/sec copy 43298.52 ( 0.00%) 47335.46 ( 9.32%) 47219.24 ( 9.06%) MB/sec scale 30115.06 ( 0.00%) 32568.12 ( 8.15%) 32527.56 ( 8.01%) MB/sec add 32825.12 ( 0.00%) 36078.94 ( 9.91%) 35928.02 ( 9.45%) MB/sec triad 32549.52 ( 0.00%) 35935.94 ( 10.40%) 35969.88 ( 10.51%) Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Jirka Hladky <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Linux-MM <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-02Merge branch 'for-mingo' of ↵Ingo Molnar14-2387/+2008
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull v4.20 RCU changes from Paul E. McKenney: - Documentation updates, including some good-eye catches from Joel Fernandes. - SRCU updates, most notably changes enabling call_srcu() to be invoked very early in the boot sequence. - Torture-test updates, including some preliminary work towards making rcutorture better able to find problems that result in insufficient grace-period forward progress. - Consolidate the RCU-bh, RCU-preempt, and RCU-sched flavors into a single flavor similar to RCU-sched in !PREEMPT kernels and into a single flavor similar to RCU-preempt (but also waiting on preempt-disabled sequences of code) in PREEMPT kernels. This branch also includes a refactoring of rcu_{nmi,irq}_{enter,exit}() from Byungchul Park. - Now that there is only one RCU flavor in any given running kernel, the many "rsp" pointers are no longer required, and this cleanup series removes them. - This branch carries out additional cleanups made possible by the RCU flavor consolidation, including inlining how-trivial functions, updating comments and definitions, and removing now-unneeded rcutorture scenarios. - Initial changes to RCU to better promote forward progress of grace periods, including fixing a bug found by Marius Hillenbrand and David Woodhouse, with the fix suggested by Peter Zijlstra. - Now that there is only one flavor of RCU in any running kernel, there is also only on rcu_data structure per CPU. This means that the rcu_dynticks structure can be merged into the rcu_data structure, a task taken on by this branch. This branch also contains a -rt-related fix from Mike Galbraith. Signed-off-by: Ingo Molnar <[email protected]>
2018-10-02Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar3-27/+84
Signed-off-by: Ingo Molnar <[email protected]>
2018-10-02sched/fair: Remove setting task's se->runnable_weight during PELT updateDietmar Eggemann2-6/+2
A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's se->runnable_weight must always be in sync with its se->load.weight. se->runnable_weight is set to se->load.weight when the task is forked (init_entity_runnable_average()) or reniced (reweight_entity()). There are two cases in set_load_weight() which since they currently only set se->load.weight could lead to a situation in which se->load.weight is different to se->runnable_weight for a CFS task: (1) A task switches to SCHED_IDLE. (2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced (during which only its static priority gets set) switches to SCHED_OTHER or SCHED_BATCH. Set se->runnable_weight to se->load.weight in these two cases to prevent this. This eliminates the need to explicitly set it to se->load.weight during PELT updates in the CFS scheduler fastpath. Signed-off-by: Dietmar Eggemann <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: Patrick Bellasi <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Quentin Perret <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vincent Guittot <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>