aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2018-01-14bpf: offload: add map offload infrastructureJakub Kicinski3-13/+226
BPF map offload follow similar path to program offload. At creation time users may specify ifindex of the device on which they want to create the map. Map will be validated by the kernel's .map_alloc_check callback and device driver will be called for the actual allocation. Map will have an empty set of operations associated with it (save for alloc and free callbacks). The real device callbacks are kept in map->offload->dev_ops because they have slightly different signatures. Map operations are called in process context so the driver may communicate with HW freely, msleep(), wait() etc. Map alloc and free callbacks are muxed via existing .ndo_bpf, and are always called with rtnl lock held. Maps and programs are guaranteed to be destroyed before .ndo_uninit (i.e. before unregister_netdev() returns). Map callbacks are invoked with bpf_devs_lock *read* locked, drivers must take care of exclusive locking if necessary. All offload-specific branches are marked with unlikely() (through bpf_map_is_dev_bound()), given that branch penalty will be negligible compared to IO anyway, and we don't want to penalize SW path unnecessarily. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-14bpf: offload: factor out netdev checking at allocation timeJakub Kicinski1-8/+20
Add a helper to check if netdev could be found and whether it has .ndo_bpf callback. There is no need to check the callback every time it's invoked, ndos can't reasonably be swapped for a set without .ndp_bpf while program is loaded. bpf_dev_offload_check() will also be used by map offload. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-14bpf: rename bpf_dev_offload -> bpf_prog_offloadJakub Kicinski1-5/+5
With map offload coming, we need to call program offload structure something less ambiguous. Pure rename, no functional changes. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-14bpf: add helper for copying attrs to struct bpf_mapJakub Kicinski7-40/+16
All map types reimplement the field-by-field copy of union bpf_attr members into struct bpf_map. Add a helper to perform this operation. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-14bpf: hashtab: move checks out of alloc functionJakub Kicinski1-16/+39
Use the new callback to perform allocation checks for hash maps. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-14bpf: hashtab: move attribute validation before allocationJakub Kicinski1-24/+23
Number of attribute checks are currently performed after hashtab is already allocated. Move them to be able to split them out to the check function later on. Checks have to now be performed on the attr union directly instead of the members of bpf_map, since bpf_map will be allocated later. No functional changes. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-14bpf: add map_alloc_check callbackJakub Kicinski1-4/+13
.map_alloc callbacks contain a number of checks validating user- -provided map attributes against constraints of a particular map type. For offloaded maps we will need to check map attributes without actually allocating any memory on the host. Add a new callback for validating attributes before any memory is allocated. This callback can be selectively implemented by map types for sharing code with offloads, or simply to separate the logical steps of validation and allocation. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-12error-injection: Support fault injection frameworkMasami Hiramatsu2-0/+350
Support in-kernel fault-injection framework via debugfs. This allows you to inject a conditional error to specified function using debugfs interfaces. Here is the result of test script described in Documentation/fault-injection/fault-injection.txt =========== # ./test_fail_function.sh 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s btrfs-progs v4.4 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: bfa96010-12e9-4360-aed0-42eec7af5798 Node size: 16384 Sector size: 4096 Filesystem size: 1001.00MiB Block group profiles: Data: single 8.00MiB Metadata: DUP 58.00MiB System: DUP 12.00MiB SSD detected: no Incompat features: extref, skinny-metadata Number of devices: 1 Devices: ID SIZE PATH 1 1001.00MiB /dev/loop2 mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory SUCCESS! =========== Signed-off-by: Masami Hiramatsu <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-01-12error-injection: Separate error-injection from kprobeMasami Hiramatsu5-171/+9
Since error-injection framework is not limited to be used by kprobes, nor bpf. Other kernel subsystems can use it freely for checking safeness of error-injection, e.g. livepatch, ftrace etc. So this separate error-injection framework from kprobes. Some differences has been made: - "kprobe" word is removed from any APIs/structures. - BPF_ALLOW_ERROR_INJECTION() is renamed to ALLOW_ERROR_INJECTION() since it is not limited for BPF too. - CONFIG_FUNCTION_ERROR_INJECTION is the config item of this feature. It is automatically enabled if the arch supports error injection feature for kprobe or ftrace etc. Signed-off-by: Masami Hiramatsu <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-01-12tracing/kprobe: bpf: Compare instruction pointer with original oneMasami Hiramatsu2-15/+7
Compare instruction pointer with original one on the stack instead using per-cpu bpf_kprobe_override flag. This patch also consolidates reset_current_kprobe() and preempt_enable_no_resched() blocks. Those can be done in one place. Signed-off-by: Masami Hiramatsu <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-01-12tracing/kprobe: bpf: Check error injectable event is on function entryMasami Hiramatsu4-15/+16
Check whether error injectable event is on function entry or not. Currently it checks the event is ftrace-based kprobes or not, but that is wrong. It should check if the event is on the entry of target function. Since error injection will override a function to just return with modified return value, that operation must be done before the target function starts making stackframe. As a side effect, bpf error injection is no need to depend on function-tracer. It can work with sw-breakpoint based kprobe events too. Signed-off-by: Masami Hiramatsu <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-01-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-17/+114
BPF alignment tests got a conflict because the registers are output as Rn_w instead of just Rn in net-next, and in net a fixup for a testcase prohibits logical operations on pointers before using them. Also, we should attempt to patch BPF call args if JIT always on is enabled. Instead, if we fail to JIT the subprogs we should pass an error back up and fail immediately. Signed-off-by: David S. Miller <[email protected]>
2018-01-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-4/+16
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-01-11 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Various BPF related improvements and fixes to nfp driver: i) do not register XDP RXQ structure to control queues, ii) round up program stack size to word size for nfp, iii) restrict MTU changes when BPF offload is active, iv) add more fully featured relocation support to JIT, v) add support for signed compare instructions to the nfp JIT, vi) export and reuse verfier log routine for nfp, and many more, from Jakub, Quentin and Nic. 2) Fix a syzkaller reported GPF in BPF's copy_verifier_state() when we hit kmalloc failure path, from Alexei. 3) Add two follow-up fixes for the recent XDP RXQ series: i) kvzalloc() allocated memory was only kfree()'ed, and ii) fix a memory leak where RX queue was not freed in netif_free_rx_queues(), from Jakub. 4) Add a sample for transferring XDP meta data into the skb, here it is used for setting skb->mark with the buffer from XDP, from Jesper. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-01-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller4-13/+100
Daniel Borkmann says: ==================== pull-request: bpf 2018-01-09 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Prevent out-of-bounds speculation in BPF maps by masking the index after bounds checks in order to fix spectre v1, and add an option BPF_JIT_ALWAYS_ON into Kconfig that allows for removing the BPF interpreter from the kernel in favor of JIT-only mode to make spectre v2 harder, from Alexei. 2) Remove false sharing of map refcount with max_entries which was used in spectre v1, from Daniel. 3) Add a missing NULL psock check in sockmap in order to fix a race, from John. 4) Fix test_align BPF selftest case since a recent change in verifier rejects the bit-wise arithmetic on pointers earlier but test_align update was missing, from Alexei. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-01-10bpf: export function to write into verifier log bufferQuentin Monnet1-4/+12
Rename the BPF verifier `verbose()` to `bpf_verifier_log_write()` and export it, so that other components (in particular, drivers for BPF offload) can reuse the user buffer log to dump error messages at verification time. Renaming `verbose()` was necessary in order to avoid a name so generic to be exported to the global namespace. However to prevent too much pain for backports, the calls to `verbose()` in the kernel BPF verifier were not changed. Instead, use function aliasing to make `verbose` point to `bpf_verifier_log_write`. Another solution could consist in making a wrapper around `verbose()`, but since it is a variadic function, I don't see a clean way without creating two identical wrappers, one for the verifier and one to export. Signed-off-by: Quentin Monnet <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-09bpf: introduce BPF_JIT_ALWAYS_ON configAlexei Starovoitov1-0/+19
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715. A quote from goolge project zero blog: "At this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel - while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel's text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets." To make attacker job harder introduce BPF_JIT_ALWAYS_ON config option that removes interpreter from the kernel in favor of JIT-only mode. So far eBPF JIT is supported by: x64, arm64, arm32, sparc64, s390, powerpc64, mips64 The start of JITed program is randomized and code page is marked as read-only. In addition "constant blinding" can be turned on with net.core.bpf_jit_harden v2->v3: - move __bpf_prog_ret0 under ifdef (Daniel) v1->v2: - fix init order, test_bpf and cBPF (Daniel's feedback) - fix offloaded bpf (Jakub's feedback) - add 'return 0' dummy in case something can invoke prog->bpf_func - retarget bpf tree. For bpf-next the patch would need one extra hunk. It will be sent when the trees are merged back to net-next Considered doing: int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT; but it seems better to land the patch as-is and in bpf-next remove bpf_jit_enable global variable from all JITs, consolidate in one place and remove this jit_init() function. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller18-57/+200
2018-01-09bpf: prevent out-of-bounds speculationAlexei Starovoitov2-11/+72
Under speculation, CPUs may mis-predict branches in bounds checks. Thus, memory accesses under a bounds check may be speculated even if the bounds check fails, providing a primitive for building a side channel. To avoid leaking kernel data round up array-based maps and mask the index after bounds check, so speculated load with out of bounds index will load either valid value from the array or zero from the padded area. Unconditionally mask index for all array types even when max_entries are not rounded to power of 2 for root user. When map is created by unpriv user generate a sequence of bpf insns that includes AND operation to make sure that JITed code includes the same 'index & index_mask' operation. If prog_array map is created by unpriv user replace bpf_tail_call(ctx, map, index); with if (index >= max_entries) { index &= map->index_mask; bpf_tail_call(ctx, map, index); } (along with roundup to power 2) to prevent out-of-bounds speculation. There is secondary redundant 'if (index >= max_entries)' in the interpreter and in all JITs, but they can be optimized later if necessary. Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array) cannot be used by unpriv, so no changes there. That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on all architectures with and without JIT. v2->v3: Daniel noticed that attack potentially can be crafted via syscall commands without loading the program, so add masking to those paths as well. Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-08Merge branch 'for-4.15-fixes' of ↵Linus Torvalds2-12/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "This contains fixes for the following two non-trivial issues: - The task iterator got broken while adding thread mode support for v4.14. It was less visible because it only triggers when both cgroup1 and cgroup2 hierarchies are in use. The recent versions of systemd uses cgroup2 for process management even when cgroup1 is used for resource control exposing this issue. - cpuset CPU hotplug path could deadlock when racing against exits. There also are two patches to replace unlimited strcpy() usages with strlcpy()" * 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC cgroup: Fix deadlock in cpu hotplug path cgroup: use strlcpy() instead of strscpy() to avoid spurious warning cgroup: avoid copying strings longer than the buffers
2018-01-08bpf: fix verifier GPF in kmalloc failure pathAlexei Starovoitov1-0/+4
syzbot reported the following panic in the verifier triggered by kmalloc error injection: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:copy_func_state kernel/bpf/verifier.c:403 [inline] RIP: 0010:copy_verifier_state+0x364/0x590 kernel/bpf/verifier.c:431 Call Trace: pop_stack+0x8c/0x270 kernel/bpf/verifier.c:449 push_stack kernel/bpf/verifier.c:491 [inline] check_cond_jmp_op kernel/bpf/verifier.c:3598 [inline] do_check+0x4b60/0xa050 kernel/bpf/verifier.c:4731 bpf_check+0x3296/0x58c0 kernel/bpf/verifier.c:5489 bpf_prog_load+0xa2a/0x1b00 kernel/bpf/syscall.c:1198 SYSC_bpf kernel/bpf/syscall.c:1807 [inline] SyS_bpf+0x1044/0x4420 kernel/bpf/syscall.c:1769 when copy_verifier_state() aborts in the middle due to kmalloc failure some of the frames could have been partially copied while current free_verifier_state() loop for (i = 0; i <= state->curframe; i++) assumed that all frames are non-null. Simply fix it by adding 'if (!state)' to free_func_state(). Also avoid stressing copy frame logic more if kzalloc fails in push_stack() free env->cur_state right away. Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)") Reported-by: [email protected] Reported-by: [email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-06Merge branch 'for-linus' of ↵Linus Torvalds2-2/+40
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: - untangle sys_close() abuses in xt_bpf - deal with register_shrinker() failures in sget() * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'" sget(): handle failures of register_shrinker() mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.
2018-01-07bpf: sockmap missing NULL psock checkJohn Fastabend1-2/+9
Add psock NULL check to handle a racing sock event that can get the sk_callback_lock before this case but after xchg happens causing the refcnt to hit zero and sock user data (psock) to be null and queued for garbage collection. Also add a comment in the code because this is a bit subtle and not obvious in my opinion. Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-06bpf: implement syscall command BPF_MAP_GET_NEXT_KEY for stacktrace mapYonghong Song1-2/+26
Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not supported for stacktrace map. However, there are use cases where user space wants to enumerate all stacktrace map entries where BPF_MAP_GET_NEXT_KEY command will be really helpful. In addition, if user space wants to delete all map entries in order to save memory and does not want to close the map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve performance if map entries are sparsely populated. The implementation has similar behavior for BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides a NULL key pointer or an invalid key, the first key is returned. Otherwise, the first valid key after the input parameter "key" is returned, or -ENOENT if no valid key can be found. Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-05fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'"Al Viro2-2/+40
Descriptor table is a shared object; it's not a place where you can stick temporary references to files, especially when we don't need an opened file at all. Cc: [email protected] # v4.14 Fixes: 98589a0998b8 ("netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'") Signed-off-by: Al Viro <[email protected]>
2018-01-04kernel/exit.c: export abort() to modulesAndrew Morton1-0/+1
gcc -fisolate-erroneous-paths-dereference can generate calls to abort() from modular code too. [[email protected]: drop duplicate exports of abort()] Link: http://lkml.kernel.org/r/[email protected] Reported-by: Vineet Gupta <[email protected]> Cc: Sudip Mukherjee <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Alexey Brodkin <[email protected]> Cc: Russell King <[email protected]> Cc: Jose Abreu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-04kernel/acct.c: fix the acct->needcheck check in check_free_space()Oleg Nesterov1-1/+1
As Tsukada explains, the time_is_before_jiffies(acct->needcheck) check is very wrong, we need time_is_after_jiffies() to make sys_acct() work. Ignoring the overflows, the code should "goto out" if needcheck > jiffies, while currently it checks "needcheck < jiffies" and thus in the likely case check_free_space() does nothing until jiffies overflow. In particular this means that sys_acct() is simply broken, acct_on() sets acct->needcheck = jiffies and expects that check_free_space() should set acct->active = 1 after the free-space check, but this won't happen if jiffies increments in between. This was broken by commit 32dc73086015 ("get rid of timer in kern/acct.c") in 2011, then another (correct) commit 795a2f22a8ea ("acct() should honour the limits from the very beginning") made the problem more visible. Link: http://lkml.kernel.org/r/[email protected] Fixes: 32dc73086015 ("get rid of timer in kern/acct.c") Reported-by: TSUKADA Koutaro <[email protected]> Suggested-by: TSUKADA Koutaro <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Cc: Al Viro <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-04bpf: only build sockmap with CONFIG_INETJohn Fastabend1-0/+2
The sockmap infrastructure is only aware of TCP sockets at the moment. In the future we plan to add UDP. In both cases CONFIG_NET should be built-in. So lets only build sockmap if CONFIG_INET is enabled. Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-04bpf: sockmap remove unused functionJohn Fastabend1-8/+0
This was added for some work that was eventually factored out but the helper call was missed. Remove it now and add it back later if needed. Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-01-03Merge branch 'for-linus' of ↵Linus Torvalds1-3/+5
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull pid allocation bug fix from Eric Biederman: "The replacement of the pid hash table and the pid bitmap with an idr resulted in an implementation that now fails more often in low memory situations. Allowing fuzzers to observe bad behavior from a memory allocation failure during pid allocation. This is a small change to fix this by making the kernel more robust in the case of error. The non-error paths are left alone so the only danger is to the already broken error path. I have manually injected errors and verified that this new error handling works" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: pid: Handle failure to allocate the first pid in a pid namespace
2017-12-31Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds3-15/+43
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A pile of fixes for long standing issues with the timer wheel and the NOHZ code: - Prevent timer base confusion accross the nohz switch, which can cause unlocked access and data corruption - Reinitialize the stale base clock on cpu hotplug to prevent subtle side effects including rollovers on 32bit - Prevent an interrupt storm when the timer softirq is already pending caused by tick_nohz_stop_sched_tick() - Move the timer start tracepoint to a place where it actually makes sense - Add documentation to timerqueue functions as they caused confusion several times now" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timerqueue: Document return values of timerqueue_add/del() timers: Invoke timer_start_debug() where it makes sense nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick() timers: Reinitialize per cpu bases on hotplug timers: Use deferrable base independent of base::nohz_active
2017-12-31Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp fixlet from Thomas Gleixner: "A trivial build warning fix for newer compilers" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Move inline keyword at the beginning of declaration
2017-12-31Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Three patches addressing the fallout of the CPU_ISOLATION changes especially with NO_HZ_FULL plus documentation of boot parameter dependency" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/isolation: Document boot parameters dependency on CONFIG_CPU_ISOLATION=y sched/isolation: Enable CONFIG_CPU_ISOLATION=y by default sched/isolation: Make CONFIG_NO_HZ_FULL select CONFIG_CPU_ISOLATION
2017-12-31Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds6-19/+77
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A rather large update after the kaisered maintainer finally found time to handle regression reports. - The larger part addresses a regression caused by the x86 vector management rework. The reservation based model does not work reliably for MSI interrupts, if they cannot be masked (yes, yet another hw engineering trainwreck). The reason is that the reservation mode assigns a dummy vector when the interrupt is allocated and switches to a real vector when the interrupt is requested. If the MSI entry cannot be masked then the initialization might raise an interrupt before the interrupt is requested, which ends up as spurious interrupt and causes device malfunction and worse. The fix is to exclude MSI interrupts which do not support masking from reservation mode and assign a real vector right away. - Extend the extra lockdep class setup for nested interrupts with a class for the recently added irq_desc::request_mutex so lockdep can differeniate and does not emit false positive warnings. - A ratelimit guard for the bad irq printout so in case a bad irq comes back immediately the system does not drown in dmesg spam" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI genirq/irqdomain: Rename early argument of irq_domain_activate_irq() x86/vector: Use IRQD_CAN_RESERVE flag genirq: Introduce IRQD_CAN_RESERVE flag genirq/msi: Handle reactivation only on success gpio: brcmstb: Make really use of the new lockdep class genirq: Guard handle_bad_irq log messages kernel/irq: Extend lockdep class for request mutex
2017-12-31bpf: offload: report device information for offloaded programsJakub Kicinski2-0/+65
Report to the user ifindex and namespace information of offloaded programs. If device has disappeared return -ENODEV. Specify the namespace using dev/inode combination. CC: Eric W. Biederman <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2017-12-31bpf: offload: free program id when device disappearsJakub Kicinski2-2/+10
Bound programs are quite useless after their device disappears. They are simply waiting for reference count to go to zero, don't list them in BPF_PROG_GET_NEXT_ID by freeing their ID early. Note that orphaned offload programs will return -ENODEV on BPF_OBJ_GET_INFO_BY_FD so user will never see ID 0. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2017-12-31bpf: offload: free prog->aux->offload when device disappearsJakub Kicinski1-14/+9
All bpf offload operations should now be under bpf_devs_lock, it's safe to free and clear the entire offload structure, not only the netdev pointer. __bpf_prog_offload_destroy() will no longer be called multiple times. Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2017-12-31bpf: offload: allow netdev to disappear while verifier is runningJakub Kicinski2-27/+23
To allow verifier instruction callbacks without any extra locking NETDEV_UNREGISTER notification would wait on a waitqueue for verifier to finish. This design decision was made when rtnl lock was providing all the locking. Use the read/write lock instead and remove the workqueue. Verifier will now call into the offload code, so dev_ops are moved to offload structure. Since verifier calls are all under bpf_prog_is_dev_bound() we no longer need static inline implementations to please builds with CONFIG_NET=n. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2017-12-31bpf: offload: don't use prog->aux->offload as booleanJakub Kicinski1-1/+3
We currently use aux->offload to indicate that program is bound to a specific device. This forces us to keep the offload structure around even after the device is gone. Add a bool member to struct bpf_prog_aux to indicate if offload was requested. Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2017-12-31bpf: offload: don't require rtnl for dev list manipulationJakub Kicinski1-10/+24
We don't need the RTNL lock for all operations on offload state. We only need to hold it around ndo calls. The device offload initialization doesn't require it. The soon-to-come querying of the offload info will only need it partially. We will also be able to remove the waitqueue in following patches. Use struct rw_semaphore because map offload will require sleeping with the semaphore held for read. Suggested-by: Kirill Tkhai <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2017-12-29timers: Invoke timer_start_debug() where it makes senseThomas Gleixner1-2/+2
The timer start debug function is called before the proper timer base is set. As a consequence the trace data contains the stale CPU and flags values. Call the debug function after setting the new base and flags. Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel") Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Paul McKenney <[email protected]> Cc: Anna-Maria Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-12-29nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()Thomas Gleixner1-2/+17
The conditions in irq_exit() to invoke tick_nohz_irq_exit() which subsequently invokes tick_nohz_stop_sched_tick() are: if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) If need_resched() is not set, but a timer softirq is pending then this is an indication that the softirq code punted and delegated the execution to softirqd. need_resched() is not true because the current interrupted task takes precedence over softirqd. Invoking tick_nohz_irq_exit() in this case can cause an endless loop of timer interrupts because the timer wheel contains an expired timer, but softirqs are not yet executed. So it returns an immediate expiry request, which causes the timer to fire immediately again. Lather, rinse and repeat.... Prevent that by adding a check for a pending timer soft interrupt to the conditions in tick_nohz_stop_sched_tick() which avoid calling get_next_timer_interrupt(). That keeps the tick sched timer on the tick and prevents a repetitive programming of an already expired timer. Reported-by: Sebastian Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Anna-Maria Gleixner <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272156050.2431@nanos
2017-12-29timers: Reinitialize per cpu bases on hotplugThomas Gleixner2-2/+17
The timer wheel bases are not (re)initialized on CPU hotplug. That leaves them with a potentially stale clk and next_expiry valuem, which can cause trouble then the CPU is plugged. Add a prepare callback which forwards the clock, sets next_expiry to far in the future and reset the control flags to a known state. Set base->must_forward_clk so the first timer which is queued will try to forward the clock to current jiffies. Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel") Reported-by: Paul E. McKenney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Anna-Maria Gleixner <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272152200.2431@nanos
2017-12-29timers: Use deferrable base independent of base::nohz_activeAnna-Maria Gleixner1-9/+7
During boot and before base::nohz_active is set in the timer bases, deferrable timers are enqueued into the standard timer base. This works correctly as long as base::nohz_active is false. Once it base::nohz_active is set and a timer which was enqueued before that is accessed the lock selector code choses the lock of the deferred base. This causes unlocked access to the standard base and in case the timer is removed it does not clear the pending flag in the standard base bitmap which causes get_next_timer_interrupt() to return bogus values. To prevent that, the deferrable timers must be enqueued in the deferrable base, even when base::nohz_active is not set. Those deferrable timers also need to be expired unconditional. Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel") Signed-off-by: Anna-Maria Gleixner <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Paul McKenney <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-12-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-12/+16
net/ipv6/ip6_gre.c is a case of parallel adds. include/trace/events/tcp.h is a little bit more tricky. The removal of in-trace-macro ifdefs in 'net' paralleled with moving show_tcp_state_name and friends over to include/trace/events/sock.h in 'net-next'. Signed-off-by: David S. Miller <[email protected]>
2017-12-29genirq/msi, x86/vector: Prevent reservation mode for non maskable MSIThomas Gleixner1-4/+33
The new reservation mode for interrupts assigns a dummy vector when the interrupt is allocated and assigns a real vector when the interrupt is requested. The reservation mode prevents vector pressure when devices with a large amount of queues/interrupts are initialized, but only a minimal subset of those queues/interrupts is actually used. This mode has an issue with MSI interrupts which cannot be masked. If the driver is not careful or the hardware emits an interrupt before the device irq is requestd by the driver then the interrupt ends up on the dummy vector as a spurious interrupt which can cause malfunction of the device or in the worst case a lockup of the machine. Change the logic for the reservation mode so that the early activation of MSI interrupts checks whether: - the device is a PCI/MSI device - the reservation mode of the underlying irqdomain is activated - PCI/MSI masking is globally enabled - the PCI/MSI device uses either MSI-X, which supports masking, or MSI with the maskbit supported. If one of those conditions is false, then clear the reservation mode flag in the irq data of the interrupt and invoke irq_domain_activate_irq() with the reserve argument cleared. In the x86 vector code, clear the can_reserve flag in the vector allocation data so a subsequent free_irq() won't create the same situation again. The interrupt stays assigned to a real vector until pci_disable_msi() is invoked and all allocations are undone. Fixes: 4900be83602b ("x86/vector/msi: Switch to global reservation mode") Reported-by: Alexandru Chirvasitu <[email protected]> Reported-by: Andy Shevchenko <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Alexandru Chirvasitu <[email protected]> Tested-by: Andy Shevchenko <[email protected]> Cc: Dou Liyang <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Maciej W. Rozycki <[email protected]> Cc: Mikael Pettersson <[email protected]> Cc: Josh Poulson <[email protected]> Cc: Mihai Costache <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: [email protected] Cc: Haiyang Zhang <[email protected]> Cc: Dexuan Cui <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Saeed Mahameed <[email protected]> Cc: Jork Loeser <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: [email protected] Cc: KY Srinivasan <[email protected]> Cc: Alan Cox <[email protected]> Cc: Sakari Ailus <[email protected]>, Cc: [email protected] Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291406420.1899@nanos Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291409460.1899@nanos
2017-12-29genirq/irqdomain: Rename early argument of irq_domain_activate_irq()Thomas Gleixner2-7/+8
The 'early' argument of irq_domain_activate_irq() is actually used to denote reservation mode. To avoid confusion, rename it before abuse happens. No functional change. Fixes: 72491643469a ("genirq/irqdomain: Update irq_domain_ops.activate() signature") Signed-off-by: Thomas Gleixner <[email protected]> Cc: Alexandru Chirvasitu <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Dou Liyang <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Maciej W. Rozycki <[email protected]> Cc: Mikael Pettersson <[email protected]> Cc: Josh Poulson <[email protected]> Cc: Mihai Costache <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: [email protected] Cc: Haiyang Zhang <[email protected]> Cc: Dexuan Cui <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Saeed Mahameed <[email protected]> Cc: Jork Loeser <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: [email protected] Cc: KY Srinivasan <[email protected]> Cc: Alan Cox <[email protected]> Cc: Sakari Ailus <[email protected]>, Cc: [email protected]
2017-12-29genirq: Introduce IRQD_CAN_RESERVE flagThomas Gleixner1-0/+1
Add a new flag to mark interrupts which can use reservation mode. This is going to be used in subsequent patches to disable reservation mode for a certain class of MSI devices. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Alexandru Chirvasitu <[email protected]> Tested-by: Andy Shevchenko <[email protected]> Cc: Dou Liyang <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Maciej W. Rozycki <[email protected]> Cc: Mikael Pettersson <[email protected]> Cc: Josh Poulson <[email protected]> Cc: Mihai Costache <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: [email protected] Cc: Haiyang Zhang <[email protected]> Cc: Dexuan Cui <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Saeed Mahameed <[email protected]> Cc: Jork Loeser <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: [email protected] Cc: KY Srinivasan <[email protected]> Cc: Alan Cox <[email protected]> Cc: Sakari Ailus <[email protected]>, Cc: [email protected]
2017-12-29genirq/msi: Handle reactivation only on successThomas Gleixner1-8/+27
When analyzing the fallout of the x86 vector allocation rework it turned out that the error handling in msi_domain_alloc_irqs() is broken. If MSI_FLAG_MUST_REACTIVATE is set for a MSI domain then it clears the activation flag for a successfully initialized msi descriptor. If a subsequent initialization fails then the error handling code path does not deactivate the interrupt because the activation flag got cleared. Move the clearing of the activation flag outside of the initialization loop so that an eventual failure can be cleaned up correctly. Fixes: 22d0b12f3560 ("genirq/irqdomain: Add force reactivation flag to irq domains") Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Alexandru Chirvasitu <[email protected]> Tested-by: Andy Shevchenko <[email protected]> Cc: Dou Liyang <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Maciej W. Rozycki <[email protected]> Cc: Mikael Pettersson <[email protected]> Cc: Josh Poulson <[email protected]> Cc: Mihai Costache <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: [email protected] Cc: Haiyang Zhang <[email protected]> Cc: Dexuan Cui <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Saeed Mahameed <[email protected]> Cc: Jork Loeser <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: [email protected] Cc: KY Srinivasan <[email protected]> Cc: Alan Cox <[email protected]> Cc: Sakari Ailus <[email protected]>, Cc: [email protected]
2017-12-29Merge tag 'pm-4.15-rc6' of ↵Linus Torvalds2-1/+14
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "This fixes a schedutil cpufreq governor regression from the 4.14 cycle that may cause a CPU idleness check to return incorrect results in some cases which leads to suboptimal decisions (Joel Fernandes)" * tag 'pm-4.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: schedutil: Use idle_calls counter of the remote CPU
2017-12-28genirq: Guard handle_bad_irq log messagesGuenter Roeck1-0/+5
An interrupt storm on a bad interrupt will cause the kernel log to be clogged. [ 60.089234] ->handle_irq(): ffffffffbe2f803f, [ 60.090455] 0xffffffffbf2af380 [ 60.090510] handle_bad_irq+0x0/0x2e5 [ 60.090522] ->irq_data.chip(): ffffffffbf2af380, [ 60.090553] IRQ_NOPROBE set [ 60.090584] ->handle_irq(): ffffffffbe2f803f, [ 60.090590] handle_bad_irq+0x0/0x2e5 [ 60.090596] ->irq_data.chip(): ffffffffbf2af380, [ 60.090602] 0xffffffffbf2af380 [ 60.090608] ->action(): (null) [ 60.090779] handle_bad_irq+0x0/0x2e5 This was seen when running an upstream kernel on Acer Chromebook R11. The system was unstable as result. Guard the log message with __printk_ratelimit to reduce the impact. This won't prevent the interrupt storm from happening, but at least the system remains stable. Signed-off-by: Guenter Roeck <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Dmitry Torokhov <[email protected]> Cc: Joe Perches <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Mika Westerberg <[email protected]> Link: https://bugzilla.kernel.org/show_bug.cgi?id=197953 Link: https://lkml.kernel.org/r/[email protected]