Age | Commit message (Collapse) | Author | Files | Lines |
|
Refactor the codes that populate the value
of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
typed bpf_map.
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Instead of having a common LRU list, this patch allows a
percpu LRU list which can be selected by specifying a map
attribute. The map attribute will be added in the later
patch.
While the common use case for LRU is #reads >> #updates,
percpu LRU list allows bpf prog to absorb unusual #updates
under pathological case (e.g. external traffic facing machine which
could be under attack).
Each percpu LRU is isolated from each other. The LRU nodes (including
free nodes) cannot be moved across different LRU Lists.
Here are the update performance comparison between
common LRU list and percpu LRU list (the test code is
at the last patch):
[[email protected] ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
1 cpus: 2934082 updates
4 cpus: 7391434 updates
8 cpus: 6500576 updates
[[email protected] ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
1 cpus: 2896553 updates
4 cpus: 9766395 updates
8 cpus: 17460553 updates
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Introduce bpf_lru_list which will provide LRU capability to
the bpf_htab in the later patch.
* General Thoughts:
1. Target use case. Read is more often than update.
(i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
If bpf_prog does a bpf_lookup_elem() first and then an in-place
update, it still counts as a read operation to the LRU list concern.
2. It may be useful to think of it as a LRU cache
3. Optimize the read case
3.1 No lock in read case
3.2 The LRU maintenance is only done during bpf_update_elem()
4. If there is a percpu LRU list, it will lose the system-wise LRU
property. A completely isolated percpu LRU list has the best
performance but the memory utilization is not ideal considering
the work load may be imbalance.
5. Hence, this patch starts the LRU implementation with a global LRU
list with batched operations before accessing the global LRU list.
As a LRU cache, #read >> #update/#insert operations, it will work well.
6. There is a local list (for each cpu) which is named
'struct bpf_lru_locallist'. This local list is not used to sort
the LRU property. Instead, the local list is to batch enough
operations before acquiring the lock of the global LRU list. More
details on this later.
7. In the later patch, it allows a percpu LRU list by specifying a
map-attribute for scalability reason and for use cases that need to
prepare for the worst (and pathological) case like DoS attack.
The percpu LRU list is completely isolated from each other and the
LRU nodes (including free nodes) cannot be moved across the list. The
following description is for the global LRU list but mostly applicable
to the percpu LRU list also.
* Global LRU List:
1. It has three sub-lists: active-list, inactive-list and free-list.
2. The two list idea, active and inactive, is borrowed from the
page cache.
3. All nodes are pre-allocated and all sit at the free-list (of the
global LRU list) at the beginning. The pre-allocation reasoning
is similar to the existing BPF_MAP_TYPE_HASH. However,
opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
the LRU map.
* Active/Inactive List (of the global LRU list):
1. The active list, as its name says it, maintains the active set of
the nodes. We can think of it as the working set or more frequently
accessed nodes. The access frequency is approximated by a ref-bit.
The ref-bit is set during the bpf_lookup_elem().
2. The inactive list, as its name also says it, maintains a less
active set of nodes. They are the candidates to be removed
from the bpf_htab when we are running out of free nodes.
3. The ordering of these two lists is acting as a rough clock.
The tail of the inactive list is the older nodes and
should be released first if the bpf_htab needs free element.
* Rotating the Active/Inactive List (of the global LRU list):
1. It is the basic operation to maintain the LRU property of
the global list.
2. The active list is only rotated when the inactive list is running
low. This idea is similar to the current page cache.
Inactive running low is currently defined as
"# of inactive < # of active".
3. The active list rotation always starts from the tail. It moves
node without ref-bit set to the head of the inactive list.
It moves node with ref-bit set back to the head of the active
list and then clears its ref-bit.
4. The inactive rotation is pretty simply.
It walks the inactive list and moves the nodes back to the head of
active list if its ref-bit is set. The ref-bit is cleared after moving
to the active list.
If the node does not have ref-bit set, it just leave it as it is
because it is already in the inactive list.
* Shrinking the Inactive List (of the global LRU list):
1. Shrinking is the operation to get free nodes when the bpf_htab is
full.
2. It usually only shrinks the inactive list to get free nodes.
3. During shrinking, it will walk the inactive list from the tail,
delete the nodes without ref-bit set from bpf_htab.
4. If no free node found after step (3), it will forcefully get
one node from the tail of inactive or active list. Forcefully is
in the sense that it ignores the ref-bit.
* Local List:
1. Each CPU has a 'struct bpf_lru_locallist'. The purpose is to
batch enough operations before acquiring the lock of the
global LRU.
2. A local list has two sub-lists, free-list and pending-list.
3. During bpf_update_elem(), it will try to get from the free-list
of (the current CPU local list).
4. If the local free-list is empty, it will acquire from the
global LRU list. The global LRU list can either satisfy it
by its global free-list or by shrinking the global inactive
list. Since we have acquired the global LRU list lock,
it will try to get at most LOCAL_FREE_TARGET elements
to the local free list.
5. When a new element is added to the bpf_htab, it will
first sit at the pending-list (of the local list) first.
The pending-list will be flushed to the global LRU list
when it needs to acquire free nodes from the global list
next time.
* Lock Consideration:
The LRU list has a lock (lru_lock). Each bucket of htab has a
lock (buck_lock). If both locks need to be acquired together,
the lock order is always lru_lock -> buck_lock and this only
happens in the bpf_lru_list.c logic.
In hashtab.c, both locks are not acquired together (i.e. one
lock is always released first before acquiring another lock).
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Alexei discovered a race condition in modules failing to load that can
cause a ftrace check to trigger and disable ftrace.
This is because of the way modules are registered to ftrace. Their
functions are loaded in the ftrace function tables but set to
"disabled" since they are still in the process of being loaded by the
module. After the module is finished, it calls back into the ftrace
infrastructure to enable it.
Looking deeper into the locations that access all the functions in the
table, I found more locations that should ignore the disabled ones"
* tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records
ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records
|
|
Several cases of bug fixes in 'net' overlapping other changes in
'net-next-.
Signed-off-by: David S. Miller <[email protected]>
|
|
Commit:
db4a835601b7 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
failed to verify that event->cgrp is actually the scheduled cgroup
in a CPU before setting cpuctx->cgrp. This patch fixes that.
Now that there is a different path for scheduled and unscheduled
cgroup, add a warning to catch when cpuctx->cgrp is still set after
the last cgroup event has been unsheduled.
To verify the bug:
# Create 2 cgroups.
mkdir /dev/cgroups/devices/g1
mkdir /dev/cgroups/devices/g2
# launch a task, bind it to a cpu and move it to g1
CPU=2
while :; do : ; done &
P=$!
taskset -pc $CPU $P
echo $P > /dev/cgroups/devices/g1/tasks
# monitor g2 (it runs no tasks) and observe output
perf stat -e cycles -I 1000 -C $CPU -G g2
# time counts unit events
1.000091408 7,579,527 cycles g2
2.000350111 <not counted> cycles g2
3.000589181 <not counted> cycles g2
4.000771428 <not counted> cycles g2
# note first line that displays that a task run in g2, despite
# g2 having no tasks. This is because cpuctx->cgrp was wrongly
# set when context of new event was installed.
# After applying the fix we obtain the right output:
perf stat -e cycles -I 1000 -C $CPU -G g2
# time counts unit events
1.000119615 <not counted> cycles g2
2.000389430 <not counted> cycles g2
3.000590962 <not counted> cycles g2
Signed-off-by: David Carrillo-Cisneros <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Nilay Vaish <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vegard Nossum <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Now since fetch_task_cputime() has no other users than task_cputime(),
its code could be used directly in task_cputime().
Moreover since only 2 task_cputime() calls of 17 use a NULL argument,
we can add dummy variables to those calls and remove NULL checks from
task_cputimes().
Also remove NULL checks from task_cputimes_scaled().
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Only s390 and powerpc have hardware facilities allowing to measure
cputimes scaled by frequency. On all other architectures
utimescaled/stimescaled are equal to utime/stime (however they are
accounted separately).
Remove {u,s}timescaled accounting on all architectures except
powerpc and s390, where those values are explicitly accounted
in the proper places.
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Currently cputime_to_scaled() just return it's argument on
all implementations, we don't need to call this function.
Signed-off-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Reviewed-by: Paul Mackerras <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Pull networking fixes from David Miller:
1) Fix off by one wrt. indexing when dumping /proc/net/route entries,
from Alexander Duyck.
2) Fix lockdep splats in iwlwifi, from Johannes Berg.
3) Cure panic when inserting certain netfilter rules when NFT_SET_HASH
is disabled, from Liping Zhang.
4) Memory leak when nft_expr_clone() fails, also from Liping Zhang.
5) Disable UFO when path will apply IPSEC tranformations, from Jakub
Sitnicki.
6) Don't bogusly double cwnd in dctcp module, from Florian Westphal.
7) skb_checksum_help() should never actually use the value "0" for the
resulting checksum, that has a special meaning, use CSUM_MANGLED_0
instead. From Eric Dumazet.
8) Per-tx/rx queue statistic strings are wrong in qed driver, fix from
Yuval MIntz.
9) Fix SCTP reference counting of associations and transports in
sctp_diag. From Xin Long.
10) When we hit ip6tunnel_xmit() we could have come from an ipv4 path in
a previous layer or similar, so explicitly clear the ipv6 control
block in the skb. From Eli Cooper.
11) Fix bogus sleeping inside of inet_wait_for_connect(), from WANG
Cong.
12) Correct deivce ID of T6 adapter in cxgb4 driver, from Hariprasad
Shenai.
13) Fix potential access past the end of the skb page frag array in
tcp_sendmsg(). From Eric Dumazet.
14) 'skb' can legitimately be NULL in inet{,6}_exact_dif_match(). Fix
from David Ahern.
15) Don't return an error in tcp_sendmsg() if we wronte any bytes
successfully, from Eric Dumazet.
16) Extraneous unlocks in netlink_diag_dump(), we removed the locking
but forgot to purge these unlock calls. From Eric Dumazet.
17) Fix memory leak in error path of __genl_register_family(). We leak
the attrbuf, from WANG Cong.
18) cgroupstats netlink policy table is mis-sized, from WANG Cong.
19) Several XDP bug fixes in mlx5, from Saeed Mahameed.
20) Fix several device refcount leaks in network drivers, from Johan
Hovold.
21) icmp6_send() should use skb dst device not skb->dev to determine L3
routing domain. From David Ahern.
22) ip_vs_genl_family sets maxattr incorrectly, from WANG Cong.
23) We leak new macvlan port in some cases of maclan_common_netlink()
errors. Fix from Gao Feng.
24) Similar to the icmp6_send() fix, icmp_route_lookup() should
determine L3 routing domain using skb_dst(skb)->dev not skb->dev.
Also from David Ahern.
25) Several fixes for route offloading and FIB notification handling in
mlxsw driver, from Jiri Pirko.
26) Properly cap __skb_flow_dissect()'s return value, from Eric Dumazet.
27) Fix long standing regression in ipv4 redirect handling, wrt.
validating the new neighbour's reachability. From Stephen Suryaputra
Lin.
28) If sk_filter() trims the packet excessively, handle it reasonably in
tcp input instead of exploding. From Eric Dumazet.
29) Fix handling of napi hash state when copying channels in sfc driver,
from Bert Kenward.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (121 commits)
mlxsw: spectrum_router: Flush FIB tables during fini
net: stmmac: Fix lack of link transition for fixed PHYs
sctp: change sk state only when it has assocs in sctp_shutdown
bnx2: Wait for in-flight DMA to complete at probe stage
Revert "bnx2: Reset device during driver initialization"
ps3_gelic: fix spelling mistake in debug message
net: ethernet: ixp4xx_eth: fix spelling mistake in debug message
ibmvnic: Fix size of debugfs name buffer
ibmvnic: Unmap ibmvnic_statistics structure
sfc: clear napi_hash state when copying channels
mlxsw: spectrum_router: Correctly dump neighbour activity
mlxsw: spectrum: Fix refcount bug on span entries
bnxt_en: Fix VF virtual link state.
bnxt_en: Fix ring arithmetic in bnxt_setup_tc().
Revert "include/uapi/linux/atm_zatm.h: include linux/time.h"
tcp: take care of truncations done by sk_filter()
ipv4: use new_gw for redirect neigh lookup
r8152: Fix error path in open function
net: bpqether.h: remove if_ether.h guard
net: __skb_flow_dissect() must cap its return value
...
|
|
Allow wakeup_dl tracer to be used by instances, like wakeup tracer
and wakeup_rt tracer.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Zhou Chengming <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
The trace_events_file.c filter logic can be a bit complex. I copy this into
a userspace program where I can debug it a bit easier. One issue is the op
is defined in most places as an int instead of as an enum, and gdb just
gives the value when debugging. Having the actual op name shown in gdb is
more useful.
This has no functionality change, but helps in debugging when the file is
debugged in user space.
Signed-off-by: Steven Rostedt <[email protected]>
|
|
Currently the filter logic for comparisons (like greater-than and less-than)
are used, they share the same function and a switch statement is used to
jump to the comparison type to perform. This is done in the extreme hot path
of the tracing code, and it does not take much more space to create a
unique comparison function to perform each type of comparison and remove the
switch statement.
Also, a bug was found where the binary and operation for 64 bits could fail
if the resulting bits were greater than 32 bits, because the result was
passed into a 32 bit variable. This was fixed when adding the separate
binary and function.
Signed-off-by: Steven Rostedt <[email protected]>
|
|
Use glob_match() to support flexible glob wildcards (*,?)
and character classes ([) for ftrace.
Since the full glob matching is slower than the current
partial matching routines(*pat, pat*, *pat*), this leaves
those routines and just add MATCH_GLOB for complex glob
expression.
e.g.
----
[root@localhost tracing]# echo 'sched*group' > set_ftrace_filter
[root@localhost tracing]# cat set_ftrace_filter
sched_free_group
sched_change_group
sched_create_group
sched_online_group
sched_destroy_group
sched_offline_group
[root@localhost tracing]# echo '[Ss]y[Ss]_*' > set_ftrace_filter
[root@localhost tracing]# head set_ftrace_filter
sys_arch_prctl
sys_rt_sigreturn
sys_ioperm
SyS_iopl
sys_modify_ldt
SyS_mmap
SyS_set_thread_area
SyS_get_thread_area
SyS_set_tid_address
sys_fork
----
Link: http://lkml.kernel.org/r/147566869501.29136.6462645009894738056.stgit@devbox
Acked-by: Namhyung Kim <[email protected]>
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
When a module is first loaded and its function ip records are added to the
ftrace list of functions to modify, they are set to DISABLED, as their text
is still in a read only state. When the module is fully loaded, and can be
updated, the flag is cleared, and if their's any functions that should be
tracing them, it is updated at that moment.
But there's several locations that do record accounting and should ignore
records that are marked as disabled, or they can cause issues.
Alexei already fixed one location, but others need to be addressed.
Cc: [email protected]
Fixes: b7ffffbb46f2 "ftrace: Add infrastructure for delayed enabling of module functions"
Reported-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
ftrace_shutdown() checks for sanity of ftrace records
and if dyn_ftrace->flags is not zero, it will warn.
It can happen that 'flags' are set to FTRACE_FL_DISABLED at this point,
since some module was loaded, but before ftrace_module_enable()
cleared the flags for this module.
In other words the module.c is doing:
ftrace_module_init(mod); // calls ftrace_update_code() that sets flags=FTRACE_FL_DISABLED
... // here ftrace_shutdown() is called that warns, since
err = prepare_coming_module(mod); // didn't have a chance to clear FTRACE_FL_DISABLED
Fix it by ignoring disabled records.
It's similar to what __ftrace_hash_rec_update() is already doing.
Link: http://lkml.kernel.org/r/[email protected]
Cc: [email protected]
Fixes: b7ffffbb46f2 "ftrace: Add infrastructure for delayed enabling of module functions"
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
Replace the custom u64_to_ptr() function with the u64_to_user_ptr()
macro.
Signed-off-by: Mickaël Salaün <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Tame initialization warning of len_abuf in audit_log_execve_info even
though there isn't presently a bug introduced by commit 43761473c254
("audit: fix a double fetch in audit_log_single_execve_arg()"). Using
UNINITIALIZED_VAR instead may mask future bugs.
Signed-off-by: Richard Guy Briggs <[email protected]>
Signed-off-by: Paul Moore <[email protected]>
|
|
The current code can result in spurious kicks when there are no grace
periods in progress and no grace-period-related requests. This is
sort of OK for a diagnostic aid, but the resulting ftrace-dump messages
in dmesg are annoying. This commit therefore avoids spurious kicks
in the common case.
Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
|
|
Expedited grace periods check dyntick-idle state, and avoid sending
IPIs to idle CPUs, including those running guest OSes, and, on NOHZ_FULL
kernels, nohz_full CPUs. However, the kernel has been observed checking
a CPU while it was non-idle, but sending the IPI after it has gone
idle. This commit therefore rechecks idle state immediately before
sending the IPI, refraining from IPIing CPUs that have since gone idle.
Reported-by: Rik van Riel <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Although rcutorture will occasionally do a 50-millisecond grace-period
delay, these delays are quite rare. And rightly so, because otherwise
the read rate would be quite low. Thie means that it can be important
to identify whether or not a given run contained a long-delay read.
This commit therefore inserts a trace_rcu_torture_read() event to flag
runs containing long delays.
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
The __call_rcu() comment about opportunistically noting grace period
beginnings and endings is obsolete. RCU still does such opportunistic
noting, but in __call_rcu_core() rather than __call_rcu(), and there
already is an appropriate comment in __call_rcu_core(). This commit
therefore removes the obsolete comment.
Reported-by: Michalis Kokologiannakis <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
|
|
In the deep past, rcu_check_callbacks() was only invoked if rcu_pending()
returned true. Which was fine, but these days rcu_check_callbacks()
is invoked unconditionally. This commit therefore removes the obsolete
sentence from the header comment.
Reported-by: Michalis Kokologiannakis <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
|
|
Commit 720abae3d68ae ("rcu: force alignment on struct
callback_head/rcu_head") forced the rcu_head (AKA callback_head)
structure's alignment to pointer size, that is, to 4-byte boundaries on
32-bit systems and to 8-byte boundaries on 64-bit systems. This
commit therefore checks for this same alignment in __call_rcu(),
which used to contain a looser check for two-byte alignment.
Signed-off-by: Paul E. McKenney <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
|
|
This reverts commit bfd8d3f23b51018388be0411ccbc2d56277fe294.
It turns out that this flushes things much too aggressiverly, and causes
lines to break up when the system logger races with new continuation
lines being printed.
There's a pending patch to make printk() flushing much more
straightforward, but it's too invasive for 4.9, so in the meantime let's
just not make the system message logging flush continuation lines.
They'll be flushed by the final newline anyway.
Suggested-by: Petr Mladek <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
"This fixes a genirq regression that resulted in the Intel/Broxton
pinctrl/GPIO driver (and possibly others) spewing warnings"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Use irq type from irqdata instead of irqdesc
|
|
|
|
Commit 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings
scheme") added a bug in that the prog's reference count is not dropped
in the error path when mlx4_en_try_alloc_resources() is failing from
mlx4_xdp_set().
We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we
need to release again. Earlier in the call path, dev_change_xdp_fd()
itself holds a reference to the prog as well (hence the '- 1' in the
bpf_prog_add()), so a simple atomic_sub() is safe to use here. When
an error is propagated, then bpf_prog_put() is called eventually from
dev_change_xdp_fd()
Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two bugs in error code paths in the PM core (system-wide
suspend of devices), a device reference leak in the boot-time suspend
test code and a cpupower utility regression from the 4.7 cycle.
Specifics:
- Prevent the PM core from attempting to suspend parent devices if
any of their children, whose suspend callbacks were invoked
asynchronously, have failed to suspend during the "late" and
"noirq" phases of system-wide suspend of devices (Brian Norris).
- Prevent the boot-time system suspend test code from leaking a
reference to the RTC device used by it (Johan Hovold).
- Fix cpupower to use the return value of one of its library
functions correctly and restore the correct behavior of it when
used for setting cpufreq tunables broken during the 4.7 development
cycle (Laura Abbott)"
* tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
PM / sleep: fix device reference leak in test_suspend
cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set
|
|
* pm-tools-fixes:
cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set
* pm-sleep-fixes:
PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
PM / sleep: fix device reference leak in test_suspend
|
|
This reverts commit 05fd007e4629 ("console: don't prefer first
registered if DT specifies stdout-path").
The reverted commit changes existing behavior on which many ARM boards
rely. Many ARM small-board-computers, like e.g. the Raspberry Pi have
both a video output and a serial console. Depending on whether the user
is using the device as a more regular computer; or as a headless device
we need to have the console on either one or the other.
Many users rely on the kernel behavior of the console being present on
both outputs, before the reverted commit the console setup with no
console= kernel arguments on an ARM board which sets stdout-path in dt
would look like this:
[root@localhost ~]# cat /proc/consoles
ttyS0 -W- (EC p a) 4:64
tty0 -WU (E p ) 4:1
Where as after the reverted commit, it looks like this:
[root@localhost ~]# cat /proc/consoles
ttyS0 -W- (EC p a) 4:64
This commit reverts commit 05fd007e4629 ("console: don't prefer first
registered if DT specifies stdout-path") restoring the original
behavior.
Fixes: 05fd007e4629 ("console: don't prefer first registered if DT specifies stdout-path")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Frank Rowand <[email protected]>
Cc: Thorsten Leemhuis <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In the comment:
/*
* The task might have changed its scheduling policy to something
* different than SCHED_DEADLINE (through switched_fromd_dl()).
*/
s/fromd/from/
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/5408b3b3f9ee197a7b7f10fb834341100a4f2c88.1478599881.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The 'class' parameter is not used, remove it.
n
Signed-off-by: Tahsin Erdogan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Remove the unused but set variables min_set and max_set in
adjust_reg_min_max_vals to fix the following warning when building with
'W=1':
kernel/bpf/verifier.c:1483:7: warning: variable ‘min_set’ set but not used [-Wunused-but-set-variable]
There is no warning about max_set being unused, but since it is only
used in the assignment of min_set it can be removed as well.
They were introduced in commit 484611357c19 ("bpf: allow access into map
value arrays") but seem to have never been used.
Cc: Josef Bacik <[email protected]>
Signed-off-by: Tobias Klauser <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Install the callbacks via the state machine.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Only calculate the affinity for the main I/O vectors, and skip the
pre or post vectors specified by struct irq_affinity.
Also remove the irq_affinity cpumask argument that has never been used.
If we ever need it in the future we can pass it through struct
irq_affinity.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Acked-by: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Only calculate the affinity for the main I/O vectors, and skip the pre or
post vectors specified by struct irq_affinity.
Also remove the irq_affinity cpumask argument that has never been used. If
we ever need it in the future we can pass it through struct irq_affinity.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Acked-by: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
The type flags in the irq descriptor are there for historical reasons and
only updated via irq_modify_status() or irq_set_type(). Both functions also
update the type flags in irqdata. __setup_irq() is the only left over user
of the type flags in the irq descriptor.
If __setup_irq() is called with empty irq type flags, then the type flags
are retrieved from irqdata. If an interrupt is shared, then the type flags
are compared with the type flags stored in the irq descriptor.
On x86 the ioapic does not have a irq_set_type() callback because the type
is defined in the BIOS tables and cannot be changed. The type is stored in
irqdata at setup time without updating the type data in the irq
descriptor. As a result the comparison described above fails.
There is no point in updating the irq descriptor flags because the only
relevant storage is irqdata. Use the type flags from irqdata for both
retrieval and comparison in __setup_irq() instead.
Aside of that the print out in case of non matching type flags has the old
and new type flags arguments flipped. Fix that as well.
For correctness sake the flags stored in the irq descriptor should be
removed, but this is beyond the scope of this bugfix and will be done in a
later patch.
Fixes: 4b357daed698 ("genirq: Look-up trigger type if not specified by caller")
Reported-and-tested-by: Mika Westerberg <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Jon Hunter <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanos
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
In map_create(), we first find and create the map, then once that
suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
a new anon fd through anon_inode_getfd(). The problem is, once the
latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
the map via map->ops->map_free(), but without uncharging the previously
locked memory first. That means that the user_struct allocation is
leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
Make the label names in the fix consistent with bpf_prog_load().
Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Commit a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
added an extra per-cpu reserve to the hash table map to restore old
behaviour from pre prealloc times. When non-prealloc is in use for a
map, then problem is that once a hash table extra element has been
linked into the hash-table, and the hash table is destroyed due to
refcount dropping to zero, then htab_map_free() -> delete_all_elements()
will walk the whole hash table and drop all elements via htab_elem_free().
The problem is that the element from the extra reserve is first fed
to the wrong backend allocator and eventually freed twice.
Fixes: a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stack vmap fixups from Thomas Gleixner:
"Two small patches related to sched_show_task():
- make sure to hold a reference on the task stack while accessing it
- remove the thread_saved_pc printout
.. and add a sanity check into release_task_stack() to catch problems
with task stack references"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Remove pointless printout in sched_show_task()
sched/core: Fix oops in sched_show_task()
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fork: Add task stack refcounting sanity check and prevent premature task stack freeing
|
|
cgroupstats_cmd_get_policy is [CGROUPSTATS_CMD_ATTR_MAX+1],
taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1],
but their family.maxattr is TASKSTATS_CMD_ATTR_MAX.
CGROUPSTATS_CMD_ATTR_MAX is less than TASKSTATS_CMD_ATTR_MAX,
so we could end up accessing out-of-bound.
Change cgroupstats_cmd_get_policy to TASKSTATS_CMD_ATTR_MAX+1,
this is safe because the rest are initialized to 0's.
Reported-by: Andrey Konovalov <[email protected]>
Tested-by: Andrey Konovalov <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
In sched_show_task() we print out a useless hex number, not even a
symbol, and there's a big question mark whether this even makes sense
anyway, I suspect we should just remove it all.
Signed-off-by: Linus Torvalds <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread
remains in the task list after its stack pointer was already set to NULL.
Therefore, thread_saved_pc() and stack_not_used() in sched_show_task()
will trigger NULL pointer dereference if an attempt to dump such thread's
traces (e.g. SysRq-t, khungtaskd) is made.
Since show_stack() in sched_show_task() calls try_get_task_stack() and
sched_show_task() is called from interrupt context, calling
try_get_task_stack() from sched_show_task() will be safe as well.
Signed-off-by: Tetsuo Handa <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Make sure to drop the reference taken by class_find_device() after
opening the RTC device.
Fixes: 77437fd4e61f (pm: boot time suspend selftest)
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Fix struct seccomp_filter and seccomp_run_filters() signatures.
Signed-off-by: Mickaël Salaün <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: James Morris <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Will Drewry <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
|
|
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|