aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2018-10-17locking/pvqspinlock: Extend node size when pvqspinlock is configuredWaiman Long2-11/+27
The qspinlock code supports up to 4 levels of slowpath nesting using four per-CPU mcs_spinlock structures. For 64-bit architectures, they fit nicely in one 64-byte cacheline. For para-virtualized (PV) qspinlocks it needs to store more information in the per-CPU node structure than there is space for. It uses a trick to use a second cacheline to hold the extra information that it needs. So PV qspinlock needs to access two extra cachelines for its information whereas the native qspinlock code only needs one extra cacheline. Freshly added counter profiling of the qspinlock code, however, revealed that it was very rare to use more than two levels of slowpath nesting. So it doesn't make sense to penalize PV qspinlock code in order to have four mcs_spinlock structures in the same cacheline to optimize for a case in the native qspinlock code that rarely happens. Extend the per-CPU node structure to have two more long words when PV qspinlock locks are configured to hold the extra data that it needs. As a result, the PV qspinlock code will enjoy the same benefit of using just one extra cacheline like the native counterpart, for most cases. [ mingo: Minor changelog edits. ] Signed-off-by: Waiman Long <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-17locking/qspinlock_stat: Count instances of nested lock slowpathsWaiman Long2-0/+11
Queued spinlock supports up to 4 levels of lock slowpath nesting - user context, soft IRQ, hard IRQ and NMI. However, we are not sure how often the nesting happens. So add 3 more per-CPU stat counters to track the number of instances where nesting index goes to 1, 2 and 3 respectively. On a dual-socket 64-core 128-thread Zen server, the following were the new stat counter values under different circumstances: State slowpath index1 index2 index3 ----- -------- ------ ------ ------- After bootup 1,012,150 82 0 0 After parallel build + perf-top 125,195,009 82 0 0 So the chance of having more than 2 levels of nesting is extremely low. [ mingo: Minor changelog edits. ] Signed-off-by: Waiman Long <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-16locking/qspinlock, x86: Provide liveness guaranteePeter Zijlstra1-1/+15
On x86 we cannot do fetch_or() with a single instruction and thus end up using a cmpxchg loop, this reduces determinism. Replace the fetch_or() with a composite operation: tas-pending + load. Using two instructions of course opens a window we previously did not have. Consider the scenario: CPU0 CPU1 CPU2 1) lock trylock -> (0,0,1) 2) lock trylock /* fail */ 3) unlock -> (0,0,0) 4) lock trylock -> (0,0,1) 5) tas-pending -> (0,1,1) load-val <- (0,1,0) from 3 6) clear-pending-set-locked -> (0,0,1) FAIL: _2_ owners where 5) is our new composite operation. When we consider each part of the qspinlock state as a separate variable (as we can when _Q_PENDING_BITS == 8) then the above is entirely possible, because tas-pending will only RmW the pending byte, so the later load is able to observe prior tail and lock state (but not earlier than its own trylock, which operates on the whole word, due to coherence). To avoid this we need 2 things: - the load must come after the tas-pending (obviously, otherwise it can trivially observe prior state). - the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for example, such that we cannot observe other state prior to setting pending. On x86 we can realize this by using "LOCK BTS m32, r32" for tas-pending followed by a regular load. Note that observing later state is not a problem: - if we fail to observe a later unlock, we'll simply spin-wait for that store to become visible. - if we observe a later xchg_tail(), there is no difference from that xchg_tail() having taken place before the tas-pending. Suggested-by: Will Deacon <[email protected]> Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Will Deacon <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Fixes: 59fb586b4a07 ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath") Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-16locking/qspinlock: Rework some commentsPeter Zijlstra1-10/+26
While working my way through the code again; I felt the comments could use help. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Will Deacon <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-16locking/qspinlock: Re-order codePeter Zijlstra1-29/+27
Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL) such that we loose the indent. This also result in a more natural code flow IMO. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Will Deacon <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-16Merge branch 'x86/build' into locking/core, to pick up dependent patches and ↵Ingo Molnar17-120/+180
unify jump-label work Signed-off-by: Ingo Molnar <[email protected]>
2018-10-16locking/lockdep: Remove duplicated 'lock_class_ops' percpu arrayWaiman Long1-1/+0
Remove the duplicated 'lock_class_ops' percpu array that is not used anywhere. Signed-off-by: Waiman Long <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Fixes: 8ca2b56cd7da ("locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-09futex: Replace spin_is_locked() with lockdepLance Roy1-2/+2
lockdep_assert_held() is better suited for checking locking requirements, since it won't get confused when the lock is held by some other task. This is also a step towards possibly removing spin_is_locked(). Signed-off-by: Lance Roy <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Darren Hart <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-10-09locking/lockdep: Make class->ops a percpu counter and move it under ↵Waiman Long3-4/+36
CONFIG_DEBUG_LOCKDEP=y A sizable portion of the CPU cycles spent on the __lock_acquire() is used up by the atomic increment of the class->ops stat counter. By taking it out from the lock_class structure and changing it to a per-cpu per-lock-class counter, we can reduce the amount of cacheline contention on the class structure when multiple CPUs are trying to acquire locks of the same class simultaneously. To limit the increase in memory consumption because of the percpu nature of that counter, it is now put back under the CONFIG_DEBUG_LOCKDEP config option. So the memory consumption increase will only occur if CONFIG_DEBUG_LOCKDEP is defined. The lock_class structure, however, is reduced in size by 16 bytes on 64-bit archs after ops removal and a minor restructuring of the fields. This patch also fixes a bug in the increment code as the counter is of the 'unsigned long' type, but atomic_inc() was used to increment it. Signed-off-by: Waiman Long <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-06Merge branch 'core/core' into x86/build, to prevent conflictsIngo Molnar2-55/+54
Signed-off-by: Ingo Molnar <[email protected]>
2018-10-03locking/lockdep: Add a faster path in __lock_release()Waiman Long1-3/+14
When __lock_release() is called, the most likely unlock scenario is on the innermost lock in the chain. In this case, we can skip some of the checks and provide a faster path to completion. Signed-off-by: Waiman Long <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-03locking/lockdep: Eliminate redundant IRQs check in __lock_acquire()Waiman Long1-8/+7
The static __lock_acquire() function has only two callers: 1) lock_acquire() 2) reacquire_held_locks() In lock_acquire(), raw_local_irq_save() is called beforehand. So IRQs must have been disabled. So the check: DEBUG_LOCKS_WARN_ON(!irqs_disabled()) is kind of redundant in this case. So move the above check to reacquire_held_locks() to eliminate redundant code in the lock_acquire() path. Signed-off-by: Waiman Long <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-03locking/lockdep: Remove add_chain_cache_classes()Waiman Long1-70/+0
The inline function add_chain_cache_classes() is defined, but has no caller. Just remove it. Signed-off-by: Waiman Long <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-02jump_label: Fix NULL dereference bug in __jump_label_mod_update()Ard Biesheuvel1-1/+1
Commit 19483677684b ("jump_label: Annotate entries that operate on __init code earlier") refactored the code that manages runtime patching of jump labels in modules that are tied to static keys defined in other modules or in the core kernel. In the latter case, we may iterate over the static_key_mod linked list until we hit the entry for the core kernel, whose 'mod' field will be NULL, and attempt to dereference it to get at its 'state' member. So let's add a non-NULL check: this forces the 'init' argument of __jump_label_update() to false for static keys that are defined in the core kernel, which is appropriate given that __init annotated jump_label entries in the core kernel should no longer be active at this point (i.e., when loading modules). Fixes: 19483677684b ("jump_label: Annotate entries that operate on ...") Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Jessica Yu <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-09-29Merge branch 'perf-urgent-for-linus' of ↵Greg Kroah-Hartman1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Thomas writes: "A single fix for a missing sanity check when a pinned event is tried to be read on the wrong CPU due to a legit event scheduling failure." * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Add sanity check to deal with pinned event failure
2018-09-28perf/core: Add sanity check to deal with pinned event failureReinette Chatre1-0/+6
It is possible that a failure can occur during the scheduling of a pinned event. The initial portion of perf_event_read_local() contains the various error checks an event should pass before it can be considered valid. Ensure that the potential scheduling failure of a pinned event is checked for and have a credible error. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/6486385d1f30336e9973b24c8c65f5079543d3d3.1537377064.git.reinette.chatre@intel.com
2018-09-27jump_table: Move entries into ro_after_init regionArd Biesheuvel1-0/+9
The __jump_table sections emitted into the core kernel and into each module consist of statically initialized references into other parts of the code, and with the exception of entries that point into init code, which are defused at post-init time, these data structures are never modified. So let's move them into the ro_after_init section, to prevent them from being corrupted inadvertently by buggy code, or deliberately by an attacker. Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Jessica Yu <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Arnd Bergmann <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Will Deacon <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Martin Schwidefsky <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-09-27jump_label: Annotate entries that operate on __init code earlierArd Biesheuvel1-34/+14
Jump table entries are mostly read-only, with the exception of the init and module loader code that defuses entries that point into init code when the code being referred to is freed. For robustness, it would be better to move these entries into the ro_after_init section, but clearing the 'code' member of each jump table entry referring to init code at module load time races with the module_enable_ro() call that remaps the ro_after_init section read only, so we'd like to do it earlier. So given that whether such an entry refers to init code can be decided much earlier, we can pull this check forward. Since we may still need the code entry at this point, let's switch to setting a low bit in the 'key' member just like we do to annotate the default state of a jump table entry. Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Arnd Bergmann <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Will Deacon <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Jessica Yu <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-09-27jump_label: Implement generic support for relative referencesArd Biesheuvel1-1/+21
To reduce the size taken up by absolute references in jump label entries themselves and the associated relocation records in the .init segment, add support for emitting them as relative references instead. Note that this requires some extra care in the sorting routine, given that the offsets change when entries are moved around in the jump_entry table. Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Arnd Bergmann <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Kees Cook <[email protected]> Cc: Will Deacon <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Jessica Yu <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-09-27jump_label: Abstract jump_entry member accessorsArd Biesheuvel1-25/+15
In preparation of allowing architectures to use relative references in jump_label entries [which can dramatically reduce the memory footprint], introduce abstractions for references to the 'code' and 'key' members of struct jump_entry. Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Arnd Bergmann <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Kees Cook <[email protected]> Cc: Will Deacon <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Jessica Yu <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-09-25dma-mapping: add the missing ARCH_HAS_SYNC_DMA_FOR_CPU_ALL declarationChristoph Hellwig1-0/+3
The patch adding the infrastructure failed to actually add the symbol declaration, oops.. Fixes: faef87723a ("dma-noncoherent: add a arch_sync_dma_for_cpu_all hook") Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Paul Burton <[email protected]> Acked-by: Florian Fainelli <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-09-25Merge gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/netGreg Kroah-Hartman1-20/+71
Dave writes: "Networking fixes: 1) Fix multiqueue handling of coalesce timer in stmmac, from Jose Abreu. 2) Fix memory corruption in NFC, from Suren Baghdasaryan. 3) Don't write reserved bits in ravb driver, from Kazuya Mizuguchi. 4) SMC bug fixes from Karsten Graul, YueHaibing, and Ursula Braun. 5) Fix TX done race in mvpp2, from Antoine Tenart. 6) ipv6 metrics leak, from Wei Wang. 7) Adjust firmware version requirements in mlxsw, from Petr Machata. 8) Fix autonegotiation on resume in r8169, from Heiner Kallweit. 9) Fixed missing entries when dumping /proc/net/if_inet6, from Jeff Barnhill. 10) Fix double free in devlink, from Dan Carpenter. 11) Fix ethtool regression from UFO feature removal, from Maciej Żenczykowski. 12) Fix drivers that have a ndo_poll_controller() that captures the cpu entirely on loaded hosts by trying to drain all rx and tx queues, from Eric Dumazet. 13) Fix memory corruption with jumbo frames in aquantia driver, from Friedemann Gerold." * gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net: (79 commits) net: mvneta: fix the remaining Rx descriptor unmapping issues ip_tunnel: be careful when accessing the inner header mpls: allow routes on ip6gre devices net: aquantia: memory corruption on jumbo frames tun: remove ndo_poll_controller nfp: remove ndo_poll_controller bnxt: remove ndo_poll_controller bnx2x: remove ndo_poll_controller mlx5: remove ndo_poll_controller mlx4: remove ndo_poll_controller i40evf: remove ndo_poll_controller ice: remove ndo_poll_controller igb: remove ndo_poll_controller ixgb: remove ndo_poll_controller fm10k: remove ndo_poll_controller ixgbevf: remove ndo_poll_controller ixgbe: remove ndo_poll_controller bonding: use netpoll_poll_dev() helper netpoll: make ndo_poll_controller() optional rds: Fix build regression. ...
2018-09-22bpf: sockmap, fix transition through disconnect without closeJohn Fastabend1-19/+41
It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE state via tcp_disconnect() without actually calling tcp_close which would then call our bpf_tcp_close() callback. Because of this a user could disconnect a socket then put it in a LISTEN state which would break our assumptions about sockets always being ESTABLISHED state. To resolve this rely on the unhash hook, which is called in the disconnect case, to remove the sock from the sockmap. Reported-by: Eric Dumazet <[email protected]> Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks") Signed-off-by: John Fastabend <[email protected]> Acked-by: Yonghong Song <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-09-22bpf: sockmap only allow ESTABLISHED sock stateJohn Fastabend1-1/+30
After this patch we only allow socks that are in ESTABLISHED state or are being added via a sock_ops event that is transitioning into an ESTABLISHED state. By allowing sock_ops events we allow users to manage sockmaps directly from sock ops programs. The two supported sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB. Similar to TLS ULP this ensures sk_user_data is correct. Reported-by: Eric Dumazet <[email protected]> Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks") Signed-off-by: John Fastabend <[email protected]> Acked-by: Yonghong Song <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-09-20kernel/sys.c: remove duplicated includeYueHaibing1-3/+0
Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: YueHaibing <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-09-20fork: report pid exhaustion correctlyKJ Tsanaktsidis1-1/+1
Make the clone and fork syscalls return EAGAIN when the limit on the number of pids /proc/sys/kernel/pid_max is exceeded. Currently, when the pid_max limit is exceeded, the kernel will return ENOSPC from the fork and clone syscalls. This is contrary to the documented behaviour, which explicitly calls out the pid_max case as one where EAGAIN should be returned. It also leads to really confusing error messages in userspace programs which will complain about a lack of disk space when they fail to create processes/threads for this reason. This error is being returned because alloc_pid() uses the idr api to find a new pid; when there are none available, idr_alloc_cyclic() returns -ENOSPC, and this is being propagated back to userspace. This behaviour has been broken before, and was explicitly fixed in commit 35f71bc0a09a ("fork: report pid reservation failure properly"), so I think -EAGAIN is definitely the right thing to return in this case. The current behaviour change dates from commit 95846ecf9dac ("pid: replace pid bitmap implementation with IDR AIP") and was I believe unintentional. This patch has no impact on the case where allocating a pid fails because the child reaper for the namespace is dead; that case will still return -ENOMEM. Link: http://lkml.kernel.org/r/[email protected] Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR AIP") Signed-off-by: KJ Tsanaktsidis <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Gargi Sharma <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-09-19Merge tag 'trace-v4.19-rc4' of ↵Greg Kroah-Hartman1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Steven writes: "Vaibhav Nagarnaik found that modifying the ring buffer size could cause a huge latency in the system because it does a while loop to free pages without releasing the CPU (on non preempt kernels). In a case where there are hundreds of thousands of pages to free it could actually cause a system stall. A properly place cond_resched() solves this issue."
2018-09-18Merge gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/netGreg Kroah-Hartman2-2/+2
Dave writes: "Various fixes, all over the place: 1) OOB data generation fix in bluetooth, from Matias Karhumaa. 2) BPF BTF boundary calculation fix, from Martin KaFai Lau. 3) Don't bug on excessive frags, to be compatible in situations mixing older and newer kernels on each end. From Juergen Gross. 4) Scheduling in RCU fix in hv_netvsc, from Stephen Hemminger. 5) Zero keying information in TLS layer before freeing copies of them, from Sabrina Dubroca. 6) Fix NULL deref in act_sample, from Davide Caratti. 7) Orphan SKB before GRO in veth to prevent crashes with XDP, from Toshiaki Makita. 8) Fix use after free in ip6_xmit, from Eric Dumazet. 9) Fix VF mac address regression in bnxt_en, from Micahel Chan. 10) Fix MSG_PEEK behavior in TLS layer, from Daniel Borkmann. 11) Programming adjustments to r8169 which fix not being to enter deep sleep states on some machines, from Kai-Heng Feng and Hans de Goede. 12) Fix DST_NOCOUNT flag handling for ipv6 routes, from Peter Oskolkov." * gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net: (45 commits) net/ipv6: do not copy dst flags on rt init qmi_wwan: set DTR for modems in forced USB2 mode clk: x86: Stop marking clocks as CLK_IS_CRITICAL r8169: Get and enable optional ether_clk clock clk: x86: add "ether_clk" alias for Bay Trail / Cherry Trail r8169: enable ASPM on RTL8106E r8169: Align ASPM/CLKREQ setting function with vendor driver Revert "kcm: remove any offset before parsing messages" kcm: remove any offset before parsing messages net: ethernet: Fix a unused function warning. net: dsa: mv88e6xxx: Fix ATU Miss Violation tls: fix currently broken MSG_PEEK behavior hv_netvsc: pair VF based on serial number PCI: hv: support reporting serial number as slot information bnxt_en: Fix VF mac address regression. ipv6: fix possible use-after-free in ip6_xmit() net: hp100: fix always-true check for link up state ARM: dts: at91: add new compatibility string for macb on sama5d3 net: macb: disable scatter-gather for macb on sama5d3 net: mvpp2: let phylink manage the carrier state ...
2018-09-17ring-buffer: Allow for rescheduling when removing pagesVaibhav Nagarnaik1-0/+2
When reducing ring buffer size, pages are removed by scheduling a work item on each CPU for the corresponding CPU ring buffer. After the pages are removed from ring buffer linked list, the pages are free()d in a tight loop. The loop does not give up CPU until all pages are removed. In a worst case behavior, when lot of pages are to be freed, it can cause system stall. After the pages are removed from the list, the free() can happen while the work is rescheduled. Call cond_resched() in the loop to prevent the system hangup. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Reported-by: Jason Behmer <[email protected]> Signed-off-by: Vaibhav Nagarnaik <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-09-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2-2/+2
Daniel Borkmann says: ==================== pull-request: bpf 2018-09-16 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix end boundary calculation in BTF for the type section, from Martin. 2) Fix and revert subtraction of pointers that was accidentally allowed for unprivileged programs, from Alexei. 3) Fix bpf_msg_pull_data() helper by using __GFP_COMP in order to avoid a warning in linearizing sg pages into a single one for large allocs, from Tushar. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-09-15Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds3-15/+22
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes: various scheduler metrics corner case fixes, a sched_features deadlock fix, and a topology fix for certain NUMA systems" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix kernel-doc notation warning sched/fair: Fix load_balance redo for !imbalance sched/fair: Fix scale_rt_capacity() for SMT sched/fair: Fix vruntime_normalized() for remote non-migration wakeup sched/pelt: Fix update_blocked_averages() for RT and DL classes sched/topology: Set correct NUMA topology type sched/debug: Fix potential deadlock when writing to sched_features
2018-09-15Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2-15/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Mostly tooling fixes, but also breakpoint and x86 PMU driver fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) perf tools: Fix maps__find_symbol_by_name() tools headers uapi: Update tools's copy of linux/if_link.h tools headers uapi: Update tools's copy of linux/vhost.h tools headers uapi: Update tools's copies of kvm headers tools headers uapi: Update tools's copy of drm/drm.h tools headers uapi: Update tools's copy of asm-generic/unistd.h tools headers uapi: Update tools's copy of linux/perf_event.h perf/core: Force USER_DS when recording user stack data perf/UAPI: Clearly mark __PERF_SAMPLE_CALLCHAIN_EARLY as internal use perf/x86/intel: Add support/quirk for the MISPREDICT bit on Knights Landing CPUs perf annotate: Fix parsing aarch64 branch instructions after objdump update perf probe powerpc: Ignore SyS symbols irrespective of endianness perf event-parse: Use fixed size string for comms perf util: Fix bad memory access in trace info. perf tools: Streamline bpf examples and headers installation perf evsel: Fix potential null pointer dereference in perf_evsel__new_idx() perf arm64: Fix include path for asm-generic/unistd.h perf/hw_breakpoint: Simplify breakpoint enable in perf_event_modify_breakpoint perf/hw_breakpoint: Enable breakpoint in modify_user_hw_breakpoint perf/hw_breakpoint: Remove superfluous bp->attr.disabled = 0 ...
2018-09-15Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds4-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Misc fixes: liblockdep fixes and ww_mutex fixes" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/ww_mutex: Fix spelling mistake "cylic" -> "cyclic" locking/lockdep: Delete unnecessary #include tools/lib/lockdep: Add dummy task_struct state member tools/lib/lockdep: Add empty nmi.h tools/lib/lockdep: Update Sasha Levin email to MSFT jump_label: Fix typo in warning message locking/mutex: Fix mutex debug call and ww_mutex documentation
2018-09-13Merge tag 'printk-for-4.19-rc4' of ↵Linus Torvalds1-7/+5
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk fix from Petr Mladek: "Revert a commit that caused "quiet", "debug", and "loglevel" early parameters to be ignored for early boot messages" * tag 'printk-for-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: Revert "printk: make sure to print log on console."
2018-09-12bpf/verifier: disallow pointer subtractionAlexei Starovoitov1-1/+1
Subtraction of pointers was accidentally allowed for unpriv programs by commit 82abbf8d2fc4. Revert that part of commit. Fixes: 82abbf8d2fc4 ("bpf: do not allow root to mangle valid pointers") Reported-by: Jann Horn <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-09-12bpf: btf: Fix end boundary calculation for type sectionMartin KaFai Lau1-1/+1
The end boundary math for type section is incorrect in btf_check_all_metas(). It just happens that hdr->type_off is always 0 for now because there are only two sections (type and string) and string section must be at the end (ensured in btf_parse_str_sec). However, type_off may not be 0 if a new section would be added later. This patch fixes it. Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header") Reported-by: Dmitry Vyukov <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Yonghong Song <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-09-11locking/lockdep, cpu/hotplug: Annotate AP threadPeter Zijlstra1-0/+28
Anybody trying to assert the cpu_hotplug_lock is held (lockdep_assert_cpus_held()) from AP callbacks will fail, because the lock is held by the BP. Stick in an explicit annotation in cpuhp_thread_fun() to make this work. Reported-by: Ingo Molnar <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Fixes: cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-09-11Revert "printk: make sure to print log on console."Petr Mladek1-7/+5
This reverts commit 375899cddcbb26881b03cb3fbdcfd600e4e67f4a. The visibility of early messages did not longer take into account "quiet", "debug", and "loglevel" early parameters. It would be possible to invalidate and recompute LOG_NOCONS flag for the affected messages. But it would be hairy. Instead this patch just reverts the problematic commit. We could come up with a better solution for the original problem. For example, we could simplify the logic and just mark messages that should always be visible or always invisible on the console. Also this patch reverts the related build fix commit ffaa619af1b06 ("printk: Fix warning about unused suppress_message_printing"). Finally, this patch does not put back the unused LOG_NOCONS flag. Link: http://lkml.kernel.org/r/[email protected] Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H . Peter Anvin" <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Steven Rostedt <[email protected]> Cc: Maninder Singh <[email protected]> Reported-by: Hans de Goede <[email protected]> Acked-by: Hans de Goede <[email protected]> Acked-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2018-09-11locking/rtmutex: Fix the preprocessor logic with normal #ifdef #else #endifSteven Rostedt (VMware)1-2/+2
Merging v4.14.68 into v4.14-rt I tripped over a conflict in the rtmutex.c code. There I found that we had: #ifdef CONFIG_DEBUG_LOCK_ALLOC [..] #endif #ifndef CONFIG_DEBUG_LOCK_ALLOC [..] #endif Really this should be: #ifdef CONFIG_DEBUG_LOCK_ALLOC [..] #else [..] #endif This cleans up that logic. Signed-off-by: Steven Rostedt (VMware) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Rosin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-09-10perf/core: Force USER_DS when recording user stack dataYabin Cui1-0/+4
Perf can record user stack data in response to a synchronous request, such as a tracepoint firing. If this happens under set_fs(KERNEL_DS), then we end up reading user stack data using __copy_from_user_inatomic() under set_fs(KERNEL_DS). I think this conflicts with the intention of using set_fs(KERNEL_DS). And it is explicitly forbidden by hardware on ARM64 when both CONFIG_ARM64_UAO and CONFIG_ARM64_PAN are used. So fix this by forcing USER_DS when recording user stack data. Signed-off-by: Yabin Cui <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: 88b0193d9418 ("perf/callchain: Force USER_DS when invoking perf_callchain_user()") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-09-10locking/ww_mutex: Fix spelling mistake "cylic" -> "cyclic"Colin Ian King1-1/+1
Trivial fix to spelling mistake in pr_err() error message Signed-off-by: Colin Ian King <[email protected]> Acked-by: Will Deacon <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-09-10locking/lockdep: Delete unnecessary #includeBen Hutchings1-1/+0
Commit: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage") added the inclusion of <trace/events/preemptirq.h>. liblockdep doesn't have a stub version of that header so now fails to build. However, commit: bff1b208a5d1 ("tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"") removed the use of functions declared in that header. So delete the #include. Signed-off-by: Ben Hutchings <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Fixes: bff1b208a5d1 ("tracing: Partial revert of "tracing: Centralize ...") Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints ...") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-09-10locking/rwsem: Make owner store task pointer of last owning readerWaiman Long3-28/+76
Currently, when a reader acquires a lock, it only sets the RWSEM_READER_OWNED bit in the owner field. The other bits are simply not used. When debugging hanging cases involving rwsems and readers, the owner value does not provide much useful information at all. This patch modifies the current behavior to always store the task_struct pointer of the last rwsem-acquiring reader in a reader-owned rwsem. This may be useful in debugging rwsem hanging cases especially if only one reader is involved. However, the task in the owner field may not the real owner or one of the real owners at all when the owner value is examined, for example, in a crash dump. So it is just an additional hint about the past history. If CONFIG_DEBUG_RWSEMS=y is enabled, the owner field will be checked at unlock time too to make sure the task pointer value is valid. That does have a slight performance cost and so is only enabled as part of that debug option. From the performance point of view, it is expected that the changes shouldn't have any noticeable performance impact. A rwsem microbenchmark (with 48 worker threads and 1:1 reader/writer ratio) was ran on a 2-socket 24-core 48-thread Haswell system. The locking rates on a 4.19-rc1 based kernel were as follows: 1) Unpatched kernel: 543.3 kops/s 2) Patched kernel: 549.2 kops/s 3) Patched kernel (CONFIG_DEBUG_RWSEMS on): 546.6 kops/s There was actually a slight increase in performance (1.1%) in this particular case. Maybe it was caused by the elimination of a branch or just a testing noise. Turning on the CONFIG_DEBUG_RWSEMS option also had less than the expected impact on performance. The least significant 2 bits of the owner value are now used to designate the rwsem is readers owned and the owners are anonymous. Signed-off-by: Waiman Long <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-09-10sched/fair: Fix kernel-doc notation warningRandy Dunlap1-0/+1
Fix kernel-doc warning for missing 'flags' parameter description: ../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg' Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: ea14b57e8a18 ("sched/cpufreq: Provide migration hint") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-09-10locking/rwsem: Exit read lock slowpath if queue empty & no writerWaiman Long1-1/+12
It was discovered that a constant stream of readers with occassional writers pounding on a rwsem may cause many of the readers to enter the slowpath unnecessarily thus increasing latency and lowering performance. In the current code, a reader entering the slowpath critical section will unconditionally set the WAITING_BIAS, if not set yet, and clear its active count even if no one is in the wait queue and no writer is present. This causes some incoming readers to observe the presence of waiters in the wait queue and hence have to go into the slowpath themselves. With sufficient numbers of readers and a relatively short lock hold time, the WAITING_BIAS may be repeatedly turned on and off and a substantial portion of the readers will go into the slowpath sustaining a rather long queue in the wait queue spinlock and repeated WAITING_BIAS on/off cycle until the logjam is broken opportunistically. To avoid this situation from happening, an additional check is added to detect the special case that the reader in the critical section is the only one in the wait queue and no writer is present. When that happens, it can just exit the slowpath and return immediately as its active count has already been set in the lock. Other incoming readers won't observe the presence of waiters and so will not be forced into the slowpath. The issue was found in a customer site where they had an application that pounded on the pread64 syscalls heavily on an XFS filesystem. The application was run in a recent 4-socket boxes with a lot of CPUs. They saw significant spinlock contention in the rwsem_down_read_failed() call. With this patch applied, the system CPU usage went down from 85% to 57%, and the spinlock contention in the pread64 syscalls was gone. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Acked-by: Will Deacon <[email protected]> Cc: Joe Mario <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-09-10jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operationsPeter Zijlstra1-0/+5
Weirdly we seem to have forgotten this... Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2018-09-10Merge branch 'locking/urgent' into locking/core, to pick up fixesIngo Molnar2-3/+2
Signed-off-by: Ingo Molnar <[email protected]>
2018-09-10jump_label: Fix typo in warning messageBorislav Petkov1-1/+1
There's no 'allocatote' - use the next best thing: 'allocate' :-) Signed-off-by: Borislav Petkov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Jason Baron <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt (VMware) <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-09-10sched/fair: Fix load_balance redo for !imbalanceVincent Guittot1-1/+1
It can happen that load_balance() finds a busiest group and then a busiest rq but the calculated imbalance is in fact 0. In such situation, detach_tasks() returns immediately and lets the flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to have pinned tasks and removed from the load balance mask. then, we redo a load balance without the busiest CPU. This creates wrong load balance situation and generates wrong task migration. If the calculated imbalance is 0, it's useless to try to find a busiest rq as no task will be migrated and we can return immediately. This situation can happen with heterogeneous system or smp system when RT tasks are decreasing the capacity of some CPUs. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-09-10sched/fair: Fix scale_rt_capacity() for SMTVincent Guittot1-3/+3
Since commit: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") scale_rt_capacity() returns the remaining capacity and not a scale factor to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by scale_rt_capacity() so we must take the sched_domain argument. Reported-by: Srikar Dronamraju <[email protected]> Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>