Age | Commit message (Collapse) | Author | Files | Lines |
|
BPF map offload follow similar path to program offload. At creation
time users may specify ifindex of the device on which they want to
create the map. Map will be validated by the kernel's
.map_alloc_check callback and device driver will be called for the
actual allocation. Map will have an empty set of operations
associated with it (save for alloc and free callbacks). The real
device callbacks are kept in map->offload->dev_ops because they
have slightly different signatures. Map operations are called in
process context so the driver may communicate with HW freely,
msleep(), wait() etc.
Map alloc and free callbacks are muxed via existing .ndo_bpf, and
are always called with rtnl lock held. Maps and programs are
guaranteed to be destroyed before .ndo_uninit (i.e. before
unregister_netdev() returns). Map callbacks are invoked with
bpf_devs_lock *read* locked, drivers must take care of exclusive
locking if necessary.
All offload-specific branches are marked with unlikely() (through
bpf_map_is_dev_bound()), given that branch penalty will be
negligible compared to IO anyway, and we don't want to penalize
SW path unnecessarily.
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Add a helper to check if netdev could be found and whether it
has .ndo_bpf callback. There is no need to check the callback
every time it's invoked, ndos can't reasonably be swapped for
a set without .ndp_bpf while program is loaded.
bpf_dev_offload_check() will also be used by map offload.
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
With map offload coming, we need to call program offload structure
something less ambiguous. Pure rename, no functional changes.
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
All map types reimplement the field-by-field copy of union bpf_attr
members into struct bpf_map. Add a helper to perform this operation.
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Use the new callback to perform allocation checks for hash maps.
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Number of attribute checks are currently performed after hashtab
is already allocated. Move them to be able to split them out to
the check function later on. Checks have to now be performed on
the attr union directly instead of the members of bpf_map, since
bpf_map will be allocated later. No functional changes.
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
.map_alloc callbacks contain a number of checks validating user-
-provided map attributes against constraints of a particular map
type. For offloaded maps we will need to check map attributes
without actually allocating any memory on the host. Add a new
callback for validating attributes before any memory is allocated.
This callback can be selectively implemented by map types for
sharing code with offloads, or simply to separate the logical
steps of validation and allocation.
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Support in-kernel fault-injection framework via debugfs.
This allows you to inject a conditional error to specified
function using debugfs interfaces.
Here is the result of test script described in
Documentation/fault-injection/fault-injection.txt
===========
# ./test_fail_function.sh
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
btrfs-progs v4.4
See http://btrfs.wiki.kernel.org for more information.
Label: (null)
UUID: bfa96010-12e9-4360-aed0-42eec7af5798
Node size: 16384
Sector size: 4096
Filesystem size: 1001.00MiB
Block group profiles:
Data: single 8.00MiB
Metadata: DUP 58.00MiB
System: DUP 12.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Number of devices: 1
Devices:
ID SIZE PATH
1 1001.00MiB /dev/loop2
mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
SUCCESS!
===========
Signed-off-by: Masami Hiramatsu <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Since error-injection framework is not limited to be used
by kprobes, nor bpf. Other kernel subsystems can use it
freely for checking safeness of error-injection, e.g.
livepatch, ftrace etc.
So this separate error-injection framework from kprobes.
Some differences has been made:
- "kprobe" word is removed from any APIs/structures.
- BPF_ALLOW_ERROR_INJECTION() is renamed to
ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
- CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
feature. It is automatically enabled if the arch supports
error injection feature for kprobe or ftrace etc.
Signed-off-by: Masami Hiramatsu <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Compare instruction pointer with original one on the
stack instead using per-cpu bpf_kprobe_override flag.
This patch also consolidates reset_current_kprobe() and
preempt_enable_no_resched() blocks. Those can be done
in one place.
Signed-off-by: Masami Hiramatsu <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Check whether error injectable event is on function entry or not.
Currently it checks the event is ftrace-based kprobes or not,
but that is wrong. It should check if the event is on the entry
of target function. Since error injection will override a function
to just return with modified return value, that operation must
be done before the target function starts making stackframe.
As a side effect, bpf error injection is no need to depend on
function-tracer. It can work with sw-breakpoint based kprobe
events too.
Signed-off-by: Masami Hiramatsu <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
BPF alignment tests got a conflict because the registers
are output as Rn_w instead of just Rn in net-next, and
in net a fixup for a testcase prohibits logical operations
on pointers before using them.
Also, we should attempt to patch BPF call args if JIT always on is
enabled. Instead, if we fail to JIT the subprogs we should pass
an error back up and fail immediately.
Signed-off-by: David S. Miller <[email protected]>
|
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-01-11
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Various BPF related improvements and fixes to nfp driver: i) do
not register XDP RXQ structure to control queues, ii) round up
program stack size to word size for nfp, iii) restrict MTU changes
when BPF offload is active, iv) add more fully featured relocation
support to JIT, v) add support for signed compare instructions to
the nfp JIT, vi) export and reuse verfier log routine for nfp, and
many more, from Jakub, Quentin and Nic.
2) Fix a syzkaller reported GPF in BPF's copy_verifier_state() when
we hit kmalloc failure path, from Alexei.
3) Add two follow-up fixes for the recent XDP RXQ series: i) kvzalloc()
allocated memory was only kfree()'ed, and ii) fix a memory leak where
RX queue was not freed in netif_free_rx_queues(), from Jakub.
4) Add a sample for transferring XDP meta data into the skb, here it
is used for setting skb->mark with the buffer from XDP, from Jesper.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
Daniel Borkmann says:
====================
pull-request: bpf 2018-01-09
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Prevent out-of-bounds speculation in BPF maps by masking the
index after bounds checks in order to fix spectre v1, and
add an option BPF_JIT_ALWAYS_ON into Kconfig that allows for
removing the BPF interpreter from the kernel in favor of
JIT-only mode to make spectre v2 harder, from Alexei.
2) Remove false sharing of map refcount with max_entries which
was used in spectre v1, from Daniel.
3) Add a missing NULL psock check in sockmap in order to fix
a race, from John.
4) Fix test_align BPF selftest case since a recent change in
verifier rejects the bit-wise arithmetic on pointers
earlier but test_align update was missing, from Alexei.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
Rename the BPF verifier `verbose()` to `bpf_verifier_log_write()` and
export it, so that other components (in particular, drivers for BPF
offload) can reuse the user buffer log to dump error messages at
verification time.
Renaming `verbose()` was necessary in order to avoid a name so generic
to be exported to the global namespace. However to prevent too much pain
for backports, the calls to `verbose()` in the kernel BPF verifier were
not changed. Instead, use function aliasing to make `verbose` point to
`bpf_verifier_log_write`. Another solution could consist in making a
wrapper around `verbose()`, but since it is a variadic function, I don't
see a clean way without creating two identical wrappers, one for the
verifier and one to export.
Signed-off-by: Quentin Monnet <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."
To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64
The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)
v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next
Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
|
|
Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.
To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.
Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index & index_mask' operation.
If prog_array map is created by unpriv user replace
bpf_tail_call(ctx, map, index);
with
if (index >= max_entries) {
index &= map->index_mask;
bpf_tail_call(ctx, map, index);
}
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index >= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.
Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.
That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.
v2->v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"This contains fixes for the following two non-trivial issues:
- The task iterator got broken while adding thread mode support for
v4.14. It was less visible because it only triggers when both
cgroup1 and cgroup2 hierarchies are in use. The recent versions of
systemd uses cgroup2 for process management even when cgroup1 is
used for resource control exposing this issue.
- cpuset CPU hotplug path could deadlock when racing against exits.
There also are two patches to replace unlimited strcpy() usages with
strlcpy()"
* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC
cgroup: Fix deadlock in cpu hotplug path
cgroup: use strlcpy() instead of strscpy() to avoid spurious warning
cgroup: avoid copying strings longer than the buffers
|
|
syzbot reported the following panic in the verifier triggered
by kmalloc error injection:
kasan: GPF could be caused by NULL-ptr deref or user memory access
RIP: 0010:copy_func_state kernel/bpf/verifier.c:403 [inline]
RIP: 0010:copy_verifier_state+0x364/0x590 kernel/bpf/verifier.c:431
Call Trace:
pop_stack+0x8c/0x270 kernel/bpf/verifier.c:449
push_stack kernel/bpf/verifier.c:491 [inline]
check_cond_jmp_op kernel/bpf/verifier.c:3598 [inline]
do_check+0x4b60/0xa050 kernel/bpf/verifier.c:4731
bpf_check+0x3296/0x58c0 kernel/bpf/verifier.c:5489
bpf_prog_load+0xa2a/0x1b00 kernel/bpf/syscall.c:1198
SYSC_bpf kernel/bpf/syscall.c:1807 [inline]
SyS_bpf+0x1044/0x4420 kernel/bpf/syscall.c:1769
when copy_verifier_state() aborts in the middle due to kmalloc failure
some of the frames could have been partially copied while
current free_verifier_state() loop
for (i = 0; i <= state->curframe; i++)
assumed that all frames are non-null.
Simply fix it by adding 'if (!state)' to free_func_state().
Also avoid stressing copy frame logic more if kzalloc fails
in push_stack() free env->cur_state right away.
Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
Reported-by: [email protected]
Reported-by: [email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
- untangle sys_close() abuses in xt_bpf
- deal with register_shrinker() failures in sget()
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'"
sget(): handle failures of register_shrinker()
mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.
|
|
Add psock NULL check to handle a racing sock event that can get the
sk_callback_lock before this case but after xchg happens causing the
refcnt to hit zero and sock user data (psock) to be null and queued
for garbage collection.
Also add a comment in the code because this is a bit subtle and
not obvious in my opinion.
Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not
supported for stacktrace map. However, there are use cases where
user space wants to enumerate all stacktrace map entries where
BPF_MAP_GET_NEXT_KEY command will be really helpful.
In addition, if user space wants to delete all map entries
in order to save memory and does not want to close the
map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve
performance if map entries are sparsely populated.
The implementation has similar behavior for
BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides
a NULL key pointer or an invalid key, the first key is returned.
Otherwise, the first valid key after the input parameter "key"
is returned, or -ENOENT if no valid key can be found.
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Descriptor table is a shared object; it's not a place where you can
stick temporary references to files, especially when we don't need
an opened file at all.
Cc: [email protected] # v4.14
Fixes: 98589a0998b8 ("netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'")
Signed-off-by: Al Viro <[email protected]>
|
|
gcc -fisolate-erroneous-paths-dereference can generate calls to abort()
from modular code too.
[[email protected]: drop duplicate exports of abort()]
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Vineet Gupta <[email protected]>
Cc: Sudip Mukherjee <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Alexey Brodkin <[email protected]>
Cc: Russell King <[email protected]>
Cc: Jose Abreu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As Tsukada explains, the time_is_before_jiffies(acct->needcheck) check
is very wrong, we need time_is_after_jiffies() to make sys_acct() work.
Ignoring the overflows, the code should "goto out" if needcheck >
jiffies, while currently it checks "needcheck < jiffies" and thus in the
likely case check_free_space() does nothing until jiffies overflow.
In particular this means that sys_acct() is simply broken, acct_on()
sets acct->needcheck = jiffies and expects that check_free_space()
should set acct->active = 1 after the free-space check, but this won't
happen if jiffies increments in between.
This was broken by commit 32dc73086015 ("get rid of timer in
kern/acct.c") in 2011, then another (correct) commit 795a2f22a8ea
("acct() should honour the limits from the very beginning") made the
problem more visible.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 32dc73086015 ("get rid of timer in kern/acct.c")
Reported-by: TSUKADA Koutaro <[email protected]>
Suggested-by: TSUKADA Koutaro <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Al Viro <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The sockmap infrastructure is only aware of TCP sockets at the
moment. In the future we plan to add UDP. In both cases CONFIG_NET
should be built-in.
So lets only build sockmap if CONFIG_INET is enabled.
Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
This was added for some work that was eventually factored out but the
helper call was missed. Remove it now and add it back later if needed.
Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull pid allocation bug fix from Eric Biederman:
"The replacement of the pid hash table and the pid bitmap with an idr
resulted in an implementation that now fails more often in low memory
situations. Allowing fuzzers to observe bad behavior from a memory
allocation failure during pid allocation.
This is a small change to fix this by making the kernel more robust in
the case of error. The non-error paths are left alone so the only
danger is to the already broken error path. I have manually injected
errors and verified that this new error handling works"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
pid: Handle failure to allocate the first pid in a pid namespace
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A pile of fixes for long standing issues with the timer wheel and the
NOHZ code:
- Prevent timer base confusion accross the nohz switch, which can
cause unlocked access and data corruption
- Reinitialize the stale base clock on cpu hotplug to prevent subtle
side effects including rollovers on 32bit
- Prevent an interrupt storm when the timer softirq is already
pending caused by tick_nohz_stop_sched_tick()
- Move the timer start tracepoint to a place where it actually makes
sense
- Add documentation to timerqueue functions as they caused confusion
several times now"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timerqueue: Document return values of timerqueue_add/del()
timers: Invoke timer_start_debug() where it makes sense
nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()
timers: Reinitialize per cpu bases on hotplug
timers: Use deferrable base independent of base::nohz_active
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp fixlet from Thomas Gleixner:
"A trivial build warning fix for newer compilers"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Move inline keyword at the beginning of declaration
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
"Three patches addressing the fallout of the CPU_ISOLATION changes
especially with NO_HZ_FULL plus documentation of boot parameter
dependency"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/isolation: Document boot parameters dependency on CONFIG_CPU_ISOLATION=y
sched/isolation: Enable CONFIG_CPU_ISOLATION=y by default
sched/isolation: Make CONFIG_NO_HZ_FULL select CONFIG_CPU_ISOLATION
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A rather large update after the kaisered maintainer finally found time
to handle regression reports.
- The larger part addresses a regression caused by the x86 vector
management rework.
The reservation based model does not work reliably for MSI
interrupts, if they cannot be masked (yes, yet another hw
engineering trainwreck). The reason is that the reservation mode
assigns a dummy vector when the interrupt is allocated and switches
to a real vector when the interrupt is requested.
If the MSI entry cannot be masked then the initialization might
raise an interrupt before the interrupt is requested, which ends up
as spurious interrupt and causes device malfunction and worse. The
fix is to exclude MSI interrupts which do not support masking from
reservation mode and assign a real vector right away.
- Extend the extra lockdep class setup for nested interrupts with a
class for the recently added irq_desc::request_mutex so lockdep can
differeniate and does not emit false positive warnings.
- A ratelimit guard for the bad irq printout so in case a bad irq
comes back immediately the system does not drown in dmesg spam"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI
genirq/irqdomain: Rename early argument of irq_domain_activate_irq()
x86/vector: Use IRQD_CAN_RESERVE flag
genirq: Introduce IRQD_CAN_RESERVE flag
genirq/msi: Handle reactivation only on success
gpio: brcmstb: Make really use of the new lockdep class
genirq: Guard handle_bad_irq log messages
kernel/irq: Extend lockdep class for request mutex
|
|
Report to the user ifindex and namespace information of offloaded
programs. If device has disappeared return -ENODEV. Specify the
namespace using dev/inode combination.
CC: Eric W. Biederman <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Bound programs are quite useless after their device disappears.
They are simply waiting for reference count to go to zero,
don't list them in BPF_PROG_GET_NEXT_ID by freeing their ID
early.
Note that orphaned offload programs will return -ENODEV on
BPF_OBJ_GET_INFO_BY_FD so user will never see ID 0.
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
All bpf offload operations should now be under bpf_devs_lock,
it's safe to free and clear the entire offload structure,
not only the netdev pointer.
__bpf_prog_offload_destroy() will no longer be called multiple
times.
Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
To allow verifier instruction callbacks without any extra locking
NETDEV_UNREGISTER notification would wait on a waitqueue for verifier
to finish. This design decision was made when rtnl lock was providing
all the locking. Use the read/write lock instead and remove the
workqueue.
Verifier will now call into the offload code, so dev_ops are moved
to offload structure. Since verifier calls are all under
bpf_prog_is_dev_bound() we no longer need static inline implementations
to please builds with CONFIG_NET=n.
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
We currently use aux->offload to indicate that program is bound
to a specific device. This forces us to keep the offload structure
around even after the device is gone. Add a bool member to
struct bpf_prog_aux to indicate if offload was requested.
Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
We don't need the RTNL lock for all operations on offload state.
We only need to hold it around ndo calls. The device offload
initialization doesn't require it. The soon-to-come querying
of the offload info will only need it partially. We will also
be able to remove the waitqueue in following patches.
Use struct rw_semaphore because map offload will require sleeping
with the semaphore held for read.
Suggested-by: Kirill Tkhai <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
The timer start debug function is called before the proper timer base is
set. As a consequence the trace data contains the stale CPU and flags
values.
Call the debug function after setting the new base and flags.
Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel")
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Sebastian Siewior <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Paul McKenney <[email protected]>
Cc: Anna-Maria Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The conditions in irq_exit() to invoke tick_nohz_irq_exit() which
subsequently invokes tick_nohz_stop_sched_tick() are:
if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu))
If need_resched() is not set, but a timer softirq is pending then this is
an indication that the softirq code punted and delegated the execution to
softirqd. need_resched() is not true because the current interrupted task
takes precedence over softirqd.
Invoking tick_nohz_irq_exit() in this case can cause an endless loop of
timer interrupts because the timer wheel contains an expired timer, but
softirqs are not yet executed. So it returns an immediate expiry request,
which causes the timer to fire immediately again. Lather, rinse and
repeat....
Prevent that by adding a check for a pending timer soft interrupt to the
conditions in tick_nohz_stop_sched_tick() which avoid calling
get_next_timer_interrupt(). That keeps the tick sched timer on the tick and
prevents a repetitive programming of an already expired timer.
Reported-by: Sebastian Siewior <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Anna-Maria Gleixner <[email protected]>
Cc: Sebastian Siewior <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272156050.2431@nanos
|
|
The timer wheel bases are not (re)initialized on CPU hotplug. That leaves
them with a potentially stale clk and next_expiry valuem, which can cause
trouble then the CPU is plugged.
Add a prepare callback which forwards the clock, sets next_expiry to far in
the future and reset the control flags to a known state.
Set base->must_forward_clk so the first timer which is queued will try to
forward the clock to current jiffies.
Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel")
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Sebastian Siewior <[email protected]>
Cc: Anna-Maria Gleixner <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272152200.2431@nanos
|
|
During boot and before base::nohz_active is set in the timer bases, deferrable
timers are enqueued into the standard timer base. This works correctly as
long as base::nohz_active is false.
Once it base::nohz_active is set and a timer which was enqueued before that
is accessed the lock selector code choses the lock of the deferred
base. This causes unlocked access to the standard base and in case the
timer is removed it does not clear the pending flag in the standard base
bitmap which causes get_next_timer_interrupt() to return bogus values.
To prevent that, the deferrable timers must be enqueued in the deferrable
base, even when base::nohz_active is not set. Those deferrable timers also
need to be expired unconditional.
Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel")
Signed-off-by: Anna-Maria Gleixner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sebastian Siewior <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Paul McKenney <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
net/ipv6/ip6_gre.c is a case of parallel adds.
include/trace/events/tcp.h is a little bit more tricky. The removal
of in-trace-macro ifdefs in 'net' paralleled with moving
show_tcp_state_name and friends over to include/trace/events/sock.h
in 'net-next'.
Signed-off-by: David S. Miller <[email protected]>
|
|
The new reservation mode for interrupts assigns a dummy vector when the
interrupt is allocated and assigns a real vector when the interrupt is
requested. The reservation mode prevents vector pressure when devices with
a large amount of queues/interrupts are initialized, but only a minimal
subset of those queues/interrupts is actually used.
This mode has an issue with MSI interrupts which cannot be masked. If the
driver is not careful or the hardware emits an interrupt before the device
irq is requestd by the driver then the interrupt ends up on the dummy
vector as a spurious interrupt which can cause malfunction of the device or
in the worst case a lockup of the machine.
Change the logic for the reservation mode so that the early activation of
MSI interrupts checks whether:
- the device is a PCI/MSI device
- the reservation mode of the underlying irqdomain is activated
- PCI/MSI masking is globally enabled
- the PCI/MSI device uses either MSI-X, which supports masking, or
MSI with the maskbit supported.
If one of those conditions is false, then clear the reservation mode flag
in the irq data of the interrupt and invoke irq_domain_activate_irq() with
the reserve argument cleared. In the x86 vector code, clear the can_reserve
flag in the vector allocation data so a subsequent free_irq() won't create
the same situation again. The interrupt stays assigned to a real vector
until pci_disable_msi() is invoked and all allocations are undone.
Fixes: 4900be83602b ("x86/vector/msi: Switch to global reservation mode")
Reported-by: Alexandru Chirvasitu <[email protected]>
Reported-by: Andy Shevchenko <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Alexandru Chirvasitu <[email protected]>
Tested-by: Andy Shevchenko <[email protected]>
Cc: Dou Liyang <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Maciej W. Rozycki <[email protected]>
Cc: Mikael Pettersson <[email protected]>
Cc: Josh Poulson <[email protected]>
Cc: Mihai Costache <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: [email protected]
Cc: Haiyang Zhang <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Saeed Mahameed <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: [email protected]
Cc: KY Srinivasan <[email protected]>
Cc: Alan Cox <[email protected]>
Cc: Sakari Ailus <[email protected]>,
Cc: [email protected]
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291406420.1899@nanos
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291409460.1899@nanos
|
|
The 'early' argument of irq_domain_activate_irq() is actually used to
denote reservation mode. To avoid confusion, rename it before abuse
happens.
No functional change.
Fixes: 72491643469a ("genirq/irqdomain: Update irq_domain_ops.activate() signature")
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Alexandru Chirvasitu <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Dou Liyang <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Maciej W. Rozycki <[email protected]>
Cc: Mikael Pettersson <[email protected]>
Cc: Josh Poulson <[email protected]>
Cc: Mihai Costache <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: [email protected]
Cc: Haiyang Zhang <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Saeed Mahameed <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: [email protected]
Cc: KY Srinivasan <[email protected]>
Cc: Alan Cox <[email protected]>
Cc: Sakari Ailus <[email protected]>,
Cc: [email protected]
|
|
Add a new flag to mark interrupts which can use reservation mode. This is
going to be used in subsequent patches to disable reservation mode for a
certain class of MSI devices.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Alexandru Chirvasitu <[email protected]>
Tested-by: Andy Shevchenko <[email protected]>
Cc: Dou Liyang <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Maciej W. Rozycki <[email protected]>
Cc: Mikael Pettersson <[email protected]>
Cc: Josh Poulson <[email protected]>
Cc: Mihai Costache <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: [email protected]
Cc: Haiyang Zhang <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Saeed Mahameed <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: [email protected]
Cc: KY Srinivasan <[email protected]>
Cc: Alan Cox <[email protected]>
Cc: Sakari Ailus <[email protected]>,
Cc: [email protected]
|
|
When analyzing the fallout of the x86 vector allocation rework it turned
out that the error handling in msi_domain_alloc_irqs() is broken.
If MSI_FLAG_MUST_REACTIVATE is set for a MSI domain then it clears the
activation flag for a successfully initialized msi descriptor. If a
subsequent initialization fails then the error handling code path does not
deactivate the interrupt because the activation flag got cleared.
Move the clearing of the activation flag outside of the initialization loop
so that an eventual failure can be cleaned up correctly.
Fixes: 22d0b12f3560 ("genirq/irqdomain: Add force reactivation flag to irq domains")
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Alexandru Chirvasitu <[email protected]>
Tested-by: Andy Shevchenko <[email protected]>
Cc: Dou Liyang <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Maciej W. Rozycki <[email protected]>
Cc: Mikael Pettersson <[email protected]>
Cc: Josh Poulson <[email protected]>
Cc: Mihai Costache <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: [email protected]
Cc: Haiyang Zhang <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Saeed Mahameed <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: [email protected]
Cc: KY Srinivasan <[email protected]>
Cc: Alan Cox <[email protected]>
Cc: Sakari Ailus <[email protected]>,
Cc: [email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"This fixes a schedutil cpufreq governor regression from the 4.14 cycle
that may cause a CPU idleness check to return incorrect results in
some cases which leads to suboptimal decisions (Joel Fernandes)"
* tag 'pm-4.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: schedutil: Use idle_calls counter of the remote CPU
|
|
An interrupt storm on a bad interrupt will cause the kernel
log to be clogged.
[ 60.089234] ->handle_irq(): ffffffffbe2f803f,
[ 60.090455] 0xffffffffbf2af380
[ 60.090510] handle_bad_irq+0x0/0x2e5
[ 60.090522] ->irq_data.chip(): ffffffffbf2af380,
[ 60.090553] IRQ_NOPROBE set
[ 60.090584] ->handle_irq(): ffffffffbe2f803f,
[ 60.090590] handle_bad_irq+0x0/0x2e5
[ 60.090596] ->irq_data.chip(): ffffffffbf2af380,
[ 60.090602] 0xffffffffbf2af380
[ 60.090608] ->action(): (null)
[ 60.090779] handle_bad_irq+0x0/0x2e5
This was seen when running an upstream kernel on Acer Chromebook R11. The
system was unstable as result.
Guard the log message with __printk_ratelimit to reduce the impact. This
won't prevent the interrupt storm from happening, but at least the system
remains stable.
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Dmitry Torokhov <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Mika Westerberg <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197953
Link: https://lkml.kernel.org/r/[email protected]
|